This analysis uses global country-level data to build and compare regularized regression and spline-based models for predicting high infant mortality and life expectancy, with an emphasis on variable selection, nonlinear effects, and predictive performance.

About the data

The analysis is based on the LifeExpectancy.csv dataset, which contains repeated country-level observations from 2000 to 2015 for 179 countries. The dataset includes demographic, health, and socioeconomic indicators such as infant mortality, BMI, HIV incidence, GDP per capita, schooling, child thinness, measles incidence, region, and economic development status, with life expectancy as the primary continuous outcome. For different modeling tasks, the data were subset by year, transformed to create a binary infant-mortality outcome (infant deaths ≥30), and split into training, validation, and testing sets. These data were then used to evaluate Ridge and LASSO models for classification and generalized additive models with splines for nonlinear prediction of life expectancy.

knitr::opts_chunk$set(echo = TRUE)
library(splines)
library(caret) #confusionMatrix
library(ISLR2)
library(gam)
library(tidyverse)
library(glmnet)
library(boot)
library(vcd) # for mosaic plots
library(leaps)

RNGkind(kind = NULL, normal.kind = NULL, sample.kind = NULL)
## [1] "Mersenne-Twister" "Inversion"        "Rejection"
set.seed(2345, sample.kind="Rounding")
## Warning in set.seed(2345, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
sample(1:10,5)
## [1]  2 10  6  1  3
rm(list=ls())

Data Cleaning

df <- read.csv("LifeExpectancy.csv")

str(df)
## 'data.frame':    2864 obs. of  13 variables:
##  $ Country                  : chr  "Turkiye" "Spain" "India" "Guyana" ...
##  $ Region                   : chr  "Middle East" "European Union" "Asia" "South America" ...
##  $ Year                     : int  2015 2015 2007 2006 2012 2006 2015 2000 2001 2008 ...
##  $ Infant_deaths            : num  11.1 2.7 51.5 32.8 3.4 9.8 6.6 8.7 22 15.3 ...
##  $ Measles                  : int  65 94 35 74 89 86 97 99 87 92 ...
##  $ BMI                      : num  27.8 26 21.2 25.3 27 26.4 26.2 25.9 27.9 26.5 ...
##  $ Incidents_HIV            : num  0.08 0.09 0.13 0.79 0.08 0.16 0.08 0.08 0.13 0.43 ...
##  $ GDP_per_capita           : int  11006 25742 1076 4146 33995 9110 9313 8971 3708 2235 ...
##  $ Thinness_five_nine_years : num  4.8 0.5 28 5.5 1.1 1.9 2.3 2.3 3.9 3.1 ...
##  $ Schooling                : num  7.8 9.7 5 7.9 12.8 7.9 12 10.2 9.6 10.9 ...
##  $ Economy_status_Developed : int  0 1 0 0 1 0 0 1 0 0 ...
##  $ Economy_status_Developing: int  1 0 1 1 0 1 1 0 1 1 ...
##  $ Life_expectancy          : num  76.5 82.8 65.4 67 81.7 78.2 71.2 71.2 71.9 68.7 ...
head(df)
##      Country                        Region Year Infant_deaths Measles  BMI
## 1    Turkiye                   Middle East 2015          11.1      65 27.8
## 2      Spain                European Union 2015           2.7      94 26.0
## 3      India                          Asia 2007          51.5      35 21.2
## 4     Guyana                 South America 2006          32.8      74 25.3
## 5     Israel                   Middle East 2012           3.4      89 27.0
## 6 Costa Rica Central America and Caribbean 2006           9.8      86 26.4
##   Incidents_HIV GDP_per_capita Thinness_five_nine_years Schooling
## 1          0.08          11006                      4.8       7.8
## 2          0.09          25742                      0.5       9.7
## 3          0.13           1076                     28.0       5.0
## 4          0.79           4146                      5.5       7.9
## 5          0.08          33995                      1.1      12.8
## 6          0.16           9110                      1.9       7.9
##   Economy_status_Developed Economy_status_Developing Life_expectancy
## 1                        0                         1            76.5
## 2                        1                         0            82.8
## 3                        0                         1            65.4
## 4                        0                         1            67.0
## 5                        1                         0            81.7
## 6                        0                         1            78.2
dim(df)
## [1] 2864   13
# Check for missing values
sum(is.na(df))
## [1] 0
# Check the ranges of continuous variables
ranges <- sapply(df[, sapply(df, is.numeric)], 
                 function(x) c(min = min(x, na.rm = TRUE),
                               max = max(x, na.rm = TRUE)))
t(ranges)
##                               min       max
## Year                      2000.00   2015.00
## Infant_deaths                1.80    138.10
## Measles                     10.00     99.00
## BMI                         19.80     32.10
## Incidents_HIV                0.01     21.68
## GDP_per_capita             148.00 112418.00
## Thinness_five_nine_years     0.10     28.60
## Schooling                    1.10     14.10
## Economy_status_Developed     0.00      1.00
## Economy_status_Developing    0.00      1.00
## Life_expectancy             39.40     83.80

Comment: Ther are no missing values in the dataset. Based on the observed ranges, all variables appear to fall within realistic and commonly reported values. For example, Year spans 2000–2015, country-level BMI averages fall between 19.8 and 32.1, infant mortality ranges from 1.8 to 138.1 per 1,000 live births, GDP per capita ranges from 148 to 112,418 USD, and life expectancy spans 39.4–83.8 years. These values are all plausible for global country-level data, so none of the variables show abnormal or implausible ranges.

String Variables

# Convert Country, Region, Economy_status_Developed, and Economy_status_Developing as factor variables
df <- df %>% mutate(
  Country = as.factor(Country),
  Region = as.factor(Region),
  Economy_status_Developed = as.factor(Economy_status_Developed),
  Economy_status_Developing = as.factor(Economy_status_Developing)
)

# Check the factor variables
table(df$Country)
## 
##                    Afghanistan                        Albania 
##                             16                             16 
##                        Algeria                         Angola 
##                             16                             16 
##            Antigua and Barbuda                      Argentina 
##                             16                             16 
##                        Armenia                      Australia 
##                             16                             16 
##                        Austria                     Azerbaijan 
##                             16                             16 
##                   Bahamas, The                        Bahrain 
##                             16                             16 
##                     Bangladesh                       Barbados 
##                             16                             16 
##                        Belarus                        Belgium 
##                             16                             16 
##                         Belize                          Benin 
##                             16                             16 
##                         Bhutan                        Bolivia 
##                             16                             16 
##         Bosnia and Herzegovina                       Botswana 
##                             16                             16 
##                         Brazil              Brunei Darussalam 
##                             16                             16 
##                       Bulgaria                   Burkina Faso 
##                             16                             16 
##                        Burundi                     Cabo Verde 
##                             16                             16 
##                       Cambodia                       Cameroon 
##                             16                             16 
##                         Canada       Central African Republic 
##                             16                             16 
##                           Chad                          Chile 
##                             16                             16 
##                          China                       Colombia 
##                             16                             16 
##                        Comoros               Congo, Dem. Rep. 
##                             16                             16 
##                    Congo, Rep.                     Costa Rica 
##                             16                             16 
##                  Cote d'Ivoire                        Croatia 
##                             16                             16 
##                           Cuba                         Cyprus 
##                             16                             16 
##                        Czechia                        Denmark 
##                             16                             16 
##                       Djibouti             Dominican Republic 
##                             16                             16 
##                        Ecuador               Egypt, Arab Rep. 
##                             16                             16 
##                    El Salvador              Equatorial Guinea 
##                             16                             16 
##                        Eritrea                        Estonia 
##                             16                             16 
##                       Eswatini                       Ethiopia 
##                             16                             16 
##                           Fiji                        Finland 
##                             16                             16 
##                         France                          Gabon 
##                             16                             16 
##                    Gambia, The                        Georgia 
##                             16                             16 
##                        Germany                          Ghana 
##                             16                             16 
##                         Greece                        Grenada 
##                             16                             16 
##                      Guatemala                         Guinea 
##                             16                             16 
##                  Guinea-Bissau                         Guyana 
##                             16                             16 
##                          Haiti                       Honduras 
##                             16                             16 
##                        Hungary                        Iceland 
##                             16                             16 
##                          India                      Indonesia 
##                             16                             16 
##             Iran, Islamic Rep.                           Iraq 
##                             16                             16 
##                        Ireland                         Israel 
##                             16                             16 
##                          Italy                        Jamaica 
##                             16                             16 
##                          Japan                         Jordan 
##                             16                             16 
##                     Kazakhstan                          Kenya 
##                             16                             16 
##                       Kiribati                         Kuwait 
##                             16                             16 
##                Kyrgyz Republic                        Lao PDR 
##                             16                             16 
##                         Latvia                        Lebanon 
##                             16                             16 
##                        Lesotho                        Liberia 
##                             16                             16 
##                          Libya                      Lithuania 
##                             16                             16 
##                     Luxembourg                     Madagascar 
##                             16                             16 
##                         Malawi                       Malaysia 
##                             16                             16 
##                       Maldives                           Mali 
##                             16                             16 
##                          Malta                     Mauritania 
##                             16                             16 
##                      Mauritius                         Mexico 
##                             16                             16 
##          Micronesia, Fed. Sts.                        Moldova 
##                             16                             16 
##                       Mongolia                     Montenegro 
##                             16                             16 
##                        Morocco                     Mozambique 
##                             16                             16 
##                        Myanmar                        Namibia 
##                             16                             16 
##                          Nepal                    Netherlands 
##                             16                             16 
##                    New Zealand                      Nicaragua 
##                             16                             16 
##                          Niger                        Nigeria 
##                             16                             16 
##                North Macedonia                         Norway 
##                             16                             16 
##                           Oman                       Pakistan 
##                             16                             16 
##                         Panama               Papua New Guinea 
##                             16                             16 
##                       Paraguay                           Peru 
##                             16                             16 
##                    Philippines                         Poland 
##                             16                             16 
##                       Portugal                          Qatar 
##                             16                             16 
##                        Romania             Russian Federation 
##                             16                             16 
##                         Rwanda                          Samoa 
##                             16                             16 
##          Sao Tome and Principe                   Saudi Arabia 
##                             16                             16 
##                        Senegal                         Serbia 
##                             16                             16 
##                     Seychelles                   Sierra Leone 
##                             16                             16 
##                      Singapore                Slovak Republic 
##                             16                             16 
##                       Slovenia                Solomon Islands 
##                             16                             16 
##                        Somalia                   South Africa 
##                             16                             16 
##                          Spain                      Sri Lanka 
##                             16                             16 
##                      St. Lucia St. Vincent and the Grenadines 
##                             16                             16 
##                       Suriname                         Sweden 
##                             16                             16 
##                    Switzerland           Syrian Arab Republic 
##                             16                             16 
##                     Tajikistan                       Tanzania 
##                             16                             16 
##                       Thailand                    Timor-Leste 
##                             16                             16 
##                           Togo                          Tonga 
##                             16                             16 
##            Trinidad and Tobago                        Tunisia 
##                             16                             16 
##                        Turkiye                   Turkmenistan 
##                             16                             16 
##                         Uganda                        Ukraine 
##                             16                             16 
##           United Arab Emirates                 United Kingdom 
##                             16                             16 
##                  United States                        Uruguay 
##                             16                             16 
##                     Uzbekistan                        Vanuatu 
##                             16                             16 
##                  Venezuela, RB                        Vietnam 
##                             16                             16 
##                    Yemen, Rep.                         Zambia 
##                             16                             16 
##                       Zimbabwe 
##                             16
table(df$Region)
## 
##                        Africa                          Asia 
##                           816                           432 
## Central America and Caribbean                European Union 
##                           304                           432 
##                   Middle East                 North America 
##                           224                            48 
##                       Oceania                Rest of Europe 
##                           176                           240 
##                 South America 
##                           192
table(df$Economy_status_Developed)
## 
##    0    1 
## 2272  592
table(df$Economy_status_Developing)
## 
##    0    1 
##  592 2272

Comment: I’ve converted Country, Region, Economy_status_Developed, and Economy_status_Developing as factor variables. They were converted well based on the output.

Continuous Variables

summary(df)
##                 Country                               Region         Year     
##  Afghanistan        :  16   Africa                       :816   Min.   :2000  
##  Albania            :  16   Asia                         :432   1st Qu.:2004  
##  Algeria            :  16   European Union               :432   Median :2008  
##  Angola             :  16   Central America and Caribbean:304   Mean   :2008  
##  Antigua and Barbuda:  16   Rest of Europe               :240   3rd Qu.:2011  
##  Argentina          :  16   Middle East                  :224   Max.   :2015  
##  (Other)            :2768   (Other)                      :416                 
##  Infant_deaths       Measles           BMI        Incidents_HIV    
##  Min.   :  1.80   Min.   :10.00   Min.   :19.80   Min.   : 0.0100  
##  1st Qu.:  8.10   1st Qu.:64.00   1st Qu.:23.20   1st Qu.: 0.0800  
##  Median : 19.60   Median :83.00   Median :25.50   Median : 0.1500  
##  Mean   : 30.36   Mean   :77.34   Mean   :25.03   Mean   : 0.8943  
##  3rd Qu.: 47.35   3rd Qu.:93.00   3rd Qu.:26.40   3rd Qu.: 0.4600  
##  Max.   :138.10   Max.   :99.00   Max.   :32.10   Max.   :21.6800  
##                                                                    
##  GDP_per_capita   Thinness_five_nine_years   Schooling     
##  Min.   :   148   Min.   : 0.1             Min.   : 1.100  
##  1st Qu.:  1416   1st Qu.: 1.6             1st Qu.: 5.100  
##  Median :  4217   Median : 3.4             Median : 7.800  
##  Mean   : 11541   Mean   : 4.9             Mean   : 7.632  
##  3rd Qu.: 12557   3rd Qu.: 7.3             3rd Qu.:10.300  
##  Max.   :112418   Max.   :28.6             Max.   :14.100  
##                                                            
##  Economy_status_Developed Economy_status_Developing Life_expectancy
##  0:2272                   0: 592                    Min.   :39.40  
##  1: 592                   1:2272                    1st Qu.:62.70  
##                                                     Median :71.40  
##                                                     Mean   :68.86  
##                                                     3rd Qu.:75.40  
##                                                     Max.   :83.80  
## 
# histograms for numeric columns to visually assess distributions
df %>%
  select_if(is.numeric) %>%
  gather(variable, value) %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ variable, scales = "free", ncol = 2) +
  labs(title = "Distributions of Continuous Variables")

Comment: I checked the continuous variables using summary statistics and visualized their distributions with histograms. No variables appeared incorrectly coded as numeric indicators, and the observed ranges were consistent with expectations. Several variables showed strong right-skewness, but no irregular values suggesting data coding errors were identified.

EDA

summary(df)
##                 Country                               Region         Year     
##  Afghanistan        :  16   Africa                       :816   Min.   :2000  
##  Albania            :  16   Asia                         :432   1st Qu.:2004  
##  Algeria            :  16   European Union               :432   Median :2008  
##  Angola             :  16   Central America and Caribbean:304   Mean   :2008  
##  Antigua and Barbuda:  16   Rest of Europe               :240   3rd Qu.:2011  
##  Argentina          :  16   Middle East                  :224   Max.   :2015  
##  (Other)            :2768   (Other)                      :416                 
##  Infant_deaths       Measles           BMI        Incidents_HIV    
##  Min.   :  1.80   Min.   :10.00   Min.   :19.80   Min.   : 0.0100  
##  1st Qu.:  8.10   1st Qu.:64.00   1st Qu.:23.20   1st Qu.: 0.0800  
##  Median : 19.60   Median :83.00   Median :25.50   Median : 0.1500  
##  Mean   : 30.36   Mean   :77.34   Mean   :25.03   Mean   : 0.8943  
##  3rd Qu.: 47.35   3rd Qu.:93.00   3rd Qu.:26.40   3rd Qu.: 0.4600  
##  Max.   :138.10   Max.   :99.00   Max.   :32.10   Max.   :21.6800  
##                                                                    
##  GDP_per_capita   Thinness_five_nine_years   Schooling     
##  Min.   :   148   Min.   : 0.1             Min.   : 1.100  
##  1st Qu.:  1416   1st Qu.: 1.6             1st Qu.: 5.100  
##  Median :  4217   Median : 3.4             Median : 7.800  
##  Mean   : 11541   Mean   : 4.9             Mean   : 7.632  
##  3rd Qu.: 12557   3rd Qu.: 7.3             3rd Qu.:10.300  
##  Max.   :112418   Max.   :28.6             Max.   :14.100  
##                                                            
##  Economy_status_Developed Economy_status_Developing Life_expectancy
##  0:2272                   0: 592                    Min.   :39.40  
##  1: 592                   1:2272                    1st Qu.:62.70  
##                                                     Median :71.40  
##                                                     Mean   :68.86  
##                                                     3rd Qu.:75.40  
##                                                     Max.   :83.80  
## 
# Histogram of numeric variables
hist(df$Year)

hist(df$Infant_deaths)

hist(df$Measles)

hist(df$BMI)

hist(df$Incidents_HIV)

hist(df$GDP_per_capita)

hist(df$Thinness_five_nine_years)

hist(df$Schooling)

hist(df$Life_expectancy)

# List of numeric variables
num_vars <- c("Infant_deaths", "Measles", "BMI", "Incidents_HIV", "GDP_per_capita", "Thinness_five_nine_years", "Schooling", "Life_expectancy")

# Correlation among numeric variables
cor_matrix <- cor(df[num_vars], use = "pairwise.complete.obs")
round(cor_matrix[1:6, 1:6], 2)
##                          Infant_deaths Measles   BMI Incidents_HIV
## Infant_deaths                     1.00   -0.53 -0.66          0.35
## Measles                          -0.53    1.00  0.42         -0.15
## BMI                              -0.66    0.42  1.00         -0.16
## Incidents_HIV                     0.35   -0.15 -0.16          1.00
## GDP_per_capita                   -0.51    0.31  0.34         -0.17
## Thinness_five_nine_years          0.48   -0.37 -0.60          0.19
##                          GDP_per_capita Thinness_five_nine_years
## Infant_deaths                     -0.51                     0.48
## Measles                            0.31                    -0.37
## BMI                                0.34                    -0.60
## Incidents_HIV                     -0.17                     0.19
## GDP_per_capita                     1.00                    -0.38
## Thinness_five_nine_years          -0.38                     1.00
# Graphical summaries
par(mfrow = c(2, 3))
for (v in num_vars) {
  hist(df[[v]], main = v, col = "lightblue", border = "white")
}

# Pairwise scatterplot matrix
pairs(df[, num_vars[1:5]], main = "Pairwise scatterplot matrix")

# Summary of categorical variables
summary(df[, c("Country", "Region", "Economy_status_Developed", "Economy_status_Developing")])
##                 Country                               Region   
##  Afghanistan        :  16   Africa                       :816  
##  Albania            :  16   Asia                         :432  
##  Algeria            :  16   European Union               :432  
##  Angola             :  16   Central America and Caribbean:304  
##  Antigua and Barbuda:  16   Rest of Europe               :240  
##  Argentina          :  16   Middle East                  :224  
##  (Other)            :2768   (Other)                      :416  
##  Economy_status_Developed Economy_status_Developing
##  0:2272                   0: 592                   
##  1: 592                   1:2272                   
##                                                    
##                                                    
##                                                    
##                                                    
## 
cat_vars <- df[, sapply(df, is.character) | sapply(df, is.factor)]

lapply(cat_vars, function(x) {
  list(
    table = table(x),
    proportion = prop.table(table(x))
  )
})
## $Country
## $Country$table
## x
##                    Afghanistan                        Albania 
##                             16                             16 
##                        Algeria                         Angola 
##                             16                             16 
##            Antigua and Barbuda                      Argentina 
##                             16                             16 
##                        Armenia                      Australia 
##                             16                             16 
##                        Austria                     Azerbaijan 
##                             16                             16 
##                   Bahamas, The                        Bahrain 
##                             16                             16 
##                     Bangladesh                       Barbados 
##                             16                             16 
##                        Belarus                        Belgium 
##                             16                             16 
##                         Belize                          Benin 
##                             16                             16 
##                         Bhutan                        Bolivia 
##                             16                             16 
##         Bosnia and Herzegovina                       Botswana 
##                             16                             16 
##                         Brazil              Brunei Darussalam 
##                             16                             16 
##                       Bulgaria                   Burkina Faso 
##                             16                             16 
##                        Burundi                     Cabo Verde 
##                             16                             16 
##                       Cambodia                       Cameroon 
##                             16                             16 
##                         Canada       Central African Republic 
##                             16                             16 
##                           Chad                          Chile 
##                             16                             16 
##                          China                       Colombia 
##                             16                             16 
##                        Comoros               Congo, Dem. Rep. 
##                             16                             16 
##                    Congo, Rep.                     Costa Rica 
##                             16                             16 
##                  Cote d'Ivoire                        Croatia 
##                             16                             16 
##                           Cuba                         Cyprus 
##                             16                             16 
##                        Czechia                        Denmark 
##                             16                             16 
##                       Djibouti             Dominican Republic 
##                             16                             16 
##                        Ecuador               Egypt, Arab Rep. 
##                             16                             16 
##                    El Salvador              Equatorial Guinea 
##                             16                             16 
##                        Eritrea                        Estonia 
##                             16                             16 
##                       Eswatini                       Ethiopia 
##                             16                             16 
##                           Fiji                        Finland 
##                             16                             16 
##                         France                          Gabon 
##                             16                             16 
##                    Gambia, The                        Georgia 
##                             16                             16 
##                        Germany                          Ghana 
##                             16                             16 
##                         Greece                        Grenada 
##                             16                             16 
##                      Guatemala                         Guinea 
##                             16                             16 
##                  Guinea-Bissau                         Guyana 
##                             16                             16 
##                          Haiti                       Honduras 
##                             16                             16 
##                        Hungary                        Iceland 
##                             16                             16 
##                          India                      Indonesia 
##                             16                             16 
##             Iran, Islamic Rep.                           Iraq 
##                             16                             16 
##                        Ireland                         Israel 
##                             16                             16 
##                          Italy                        Jamaica 
##                             16                             16 
##                          Japan                         Jordan 
##                             16                             16 
##                     Kazakhstan                          Kenya 
##                             16                             16 
##                       Kiribati                         Kuwait 
##                             16                             16 
##                Kyrgyz Republic                        Lao PDR 
##                             16                             16 
##                         Latvia                        Lebanon 
##                             16                             16 
##                        Lesotho                        Liberia 
##                             16                             16 
##                          Libya                      Lithuania 
##                             16                             16 
##                     Luxembourg                     Madagascar 
##                             16                             16 
##                         Malawi                       Malaysia 
##                             16                             16 
##                       Maldives                           Mali 
##                             16                             16 
##                          Malta                     Mauritania 
##                             16                             16 
##                      Mauritius                         Mexico 
##                             16                             16 
##          Micronesia, Fed. Sts.                        Moldova 
##                             16                             16 
##                       Mongolia                     Montenegro 
##                             16                             16 
##                        Morocco                     Mozambique 
##                             16                             16 
##                        Myanmar                        Namibia 
##                             16                             16 
##                          Nepal                    Netherlands 
##                             16                             16 
##                    New Zealand                      Nicaragua 
##                             16                             16 
##                          Niger                        Nigeria 
##                             16                             16 
##                North Macedonia                         Norway 
##                             16                             16 
##                           Oman                       Pakistan 
##                             16                             16 
##                         Panama               Papua New Guinea 
##                             16                             16 
##                       Paraguay                           Peru 
##                             16                             16 
##                    Philippines                         Poland 
##                             16                             16 
##                       Portugal                          Qatar 
##                             16                             16 
##                        Romania             Russian Federation 
##                             16                             16 
##                         Rwanda                          Samoa 
##                             16                             16 
##          Sao Tome and Principe                   Saudi Arabia 
##                             16                             16 
##                        Senegal                         Serbia 
##                             16                             16 
##                     Seychelles                   Sierra Leone 
##                             16                             16 
##                      Singapore                Slovak Republic 
##                             16                             16 
##                       Slovenia                Solomon Islands 
##                             16                             16 
##                        Somalia                   South Africa 
##                             16                             16 
##                          Spain                      Sri Lanka 
##                             16                             16 
##                      St. Lucia St. Vincent and the Grenadines 
##                             16                             16 
##                       Suriname                         Sweden 
##                             16                             16 
##                    Switzerland           Syrian Arab Republic 
##                             16                             16 
##                     Tajikistan                       Tanzania 
##                             16                             16 
##                       Thailand                    Timor-Leste 
##                             16                             16 
##                           Togo                          Tonga 
##                             16                             16 
##            Trinidad and Tobago                        Tunisia 
##                             16                             16 
##                        Turkiye                   Turkmenistan 
##                             16                             16 
##                         Uganda                        Ukraine 
##                             16                             16 
##           United Arab Emirates                 United Kingdom 
##                             16                             16 
##                  United States                        Uruguay 
##                             16                             16 
##                     Uzbekistan                        Vanuatu 
##                             16                             16 
##                  Venezuela, RB                        Vietnam 
##                             16                             16 
##                    Yemen, Rep.                         Zambia 
##                             16                             16 
##                       Zimbabwe 
##                             16 
## 
## $Country$proportion
## x
##                    Afghanistan                        Albania 
##                    0.005586592                    0.005586592 
##                        Algeria                         Angola 
##                    0.005586592                    0.005586592 
##            Antigua and Barbuda                      Argentina 
##                    0.005586592                    0.005586592 
##                        Armenia                      Australia 
##                    0.005586592                    0.005586592 
##                        Austria                     Azerbaijan 
##                    0.005586592                    0.005586592 
##                   Bahamas, The                        Bahrain 
##                    0.005586592                    0.005586592 
##                     Bangladesh                       Barbados 
##                    0.005586592                    0.005586592 
##                        Belarus                        Belgium 
##                    0.005586592                    0.005586592 
##                         Belize                          Benin 
##                    0.005586592                    0.005586592 
##                         Bhutan                        Bolivia 
##                    0.005586592                    0.005586592 
##         Bosnia and Herzegovina                       Botswana 
##                    0.005586592                    0.005586592 
##                         Brazil              Brunei Darussalam 
##                    0.005586592                    0.005586592 
##                       Bulgaria                   Burkina Faso 
##                    0.005586592                    0.005586592 
##                        Burundi                     Cabo Verde 
##                    0.005586592                    0.005586592 
##                       Cambodia                       Cameroon 
##                    0.005586592                    0.005586592 
##                         Canada       Central African Republic 
##                    0.005586592                    0.005586592 
##                           Chad                          Chile 
##                    0.005586592                    0.005586592 
##                          China                       Colombia 
##                    0.005586592                    0.005586592 
##                        Comoros               Congo, Dem. Rep. 
##                    0.005586592                    0.005586592 
##                    Congo, Rep.                     Costa Rica 
##                    0.005586592                    0.005586592 
##                  Cote d'Ivoire                        Croatia 
##                    0.005586592                    0.005586592 
##                           Cuba                         Cyprus 
##                    0.005586592                    0.005586592 
##                        Czechia                        Denmark 
##                    0.005586592                    0.005586592 
##                       Djibouti             Dominican Republic 
##                    0.005586592                    0.005586592 
##                        Ecuador               Egypt, Arab Rep. 
##                    0.005586592                    0.005586592 
##                    El Salvador              Equatorial Guinea 
##                    0.005586592                    0.005586592 
##                        Eritrea                        Estonia 
##                    0.005586592                    0.005586592 
##                       Eswatini                       Ethiopia 
##                    0.005586592                    0.005586592 
##                           Fiji                        Finland 
##                    0.005586592                    0.005586592 
##                         France                          Gabon 
##                    0.005586592                    0.005586592 
##                    Gambia, The                        Georgia 
##                    0.005586592                    0.005586592 
##                        Germany                          Ghana 
##                    0.005586592                    0.005586592 
##                         Greece                        Grenada 
##                    0.005586592                    0.005586592 
##                      Guatemala                         Guinea 
##                    0.005586592                    0.005586592 
##                  Guinea-Bissau                         Guyana 
##                    0.005586592                    0.005586592 
##                          Haiti                       Honduras 
##                    0.005586592                    0.005586592 
##                        Hungary                        Iceland 
##                    0.005586592                    0.005586592 
##                          India                      Indonesia 
##                    0.005586592                    0.005586592 
##             Iran, Islamic Rep.                           Iraq 
##                    0.005586592                    0.005586592 
##                        Ireland                         Israel 
##                    0.005586592                    0.005586592 
##                          Italy                        Jamaica 
##                    0.005586592                    0.005586592 
##                          Japan                         Jordan 
##                    0.005586592                    0.005586592 
##                     Kazakhstan                          Kenya 
##                    0.005586592                    0.005586592 
##                       Kiribati                         Kuwait 
##                    0.005586592                    0.005586592 
##                Kyrgyz Republic                        Lao PDR 
##                    0.005586592                    0.005586592 
##                         Latvia                        Lebanon 
##                    0.005586592                    0.005586592 
##                        Lesotho                        Liberia 
##                    0.005586592                    0.005586592 
##                          Libya                      Lithuania 
##                    0.005586592                    0.005586592 
##                     Luxembourg                     Madagascar 
##                    0.005586592                    0.005586592 
##                         Malawi                       Malaysia 
##                    0.005586592                    0.005586592 
##                       Maldives                           Mali 
##                    0.005586592                    0.005586592 
##                          Malta                     Mauritania 
##                    0.005586592                    0.005586592 
##                      Mauritius                         Mexico 
##                    0.005586592                    0.005586592 
##          Micronesia, Fed. Sts.                        Moldova 
##                    0.005586592                    0.005586592 
##                       Mongolia                     Montenegro 
##                    0.005586592                    0.005586592 
##                        Morocco                     Mozambique 
##                    0.005586592                    0.005586592 
##                        Myanmar                        Namibia 
##                    0.005586592                    0.005586592 
##                          Nepal                    Netherlands 
##                    0.005586592                    0.005586592 
##                    New Zealand                      Nicaragua 
##                    0.005586592                    0.005586592 
##                          Niger                        Nigeria 
##                    0.005586592                    0.005586592 
##                North Macedonia                         Norway 
##                    0.005586592                    0.005586592 
##                           Oman                       Pakistan 
##                    0.005586592                    0.005586592 
##                         Panama               Papua New Guinea 
##                    0.005586592                    0.005586592 
##                       Paraguay                           Peru 
##                    0.005586592                    0.005586592 
##                    Philippines                         Poland 
##                    0.005586592                    0.005586592 
##                       Portugal                          Qatar 
##                    0.005586592                    0.005586592 
##                        Romania             Russian Federation 
##                    0.005586592                    0.005586592 
##                         Rwanda                          Samoa 
##                    0.005586592                    0.005586592 
##          Sao Tome and Principe                   Saudi Arabia 
##                    0.005586592                    0.005586592 
##                        Senegal                         Serbia 
##                    0.005586592                    0.005586592 
##                     Seychelles                   Sierra Leone 
##                    0.005586592                    0.005586592 
##                      Singapore                Slovak Republic 
##                    0.005586592                    0.005586592 
##                       Slovenia                Solomon Islands 
##                    0.005586592                    0.005586592 
##                        Somalia                   South Africa 
##                    0.005586592                    0.005586592 
##                          Spain                      Sri Lanka 
##                    0.005586592                    0.005586592 
##                      St. Lucia St. Vincent and the Grenadines 
##                    0.005586592                    0.005586592 
##                       Suriname                         Sweden 
##                    0.005586592                    0.005586592 
##                    Switzerland           Syrian Arab Republic 
##                    0.005586592                    0.005586592 
##                     Tajikistan                       Tanzania 
##                    0.005586592                    0.005586592 
##                       Thailand                    Timor-Leste 
##                    0.005586592                    0.005586592 
##                           Togo                          Tonga 
##                    0.005586592                    0.005586592 
##            Trinidad and Tobago                        Tunisia 
##                    0.005586592                    0.005586592 
##                        Turkiye                   Turkmenistan 
##                    0.005586592                    0.005586592 
##                         Uganda                        Ukraine 
##                    0.005586592                    0.005586592 
##           United Arab Emirates                 United Kingdom 
##                    0.005586592                    0.005586592 
##                  United States                        Uruguay 
##                    0.005586592                    0.005586592 
##                     Uzbekistan                        Vanuatu 
##                    0.005586592                    0.005586592 
##                  Venezuela, RB                        Vietnam 
##                    0.005586592                    0.005586592 
##                    Yemen, Rep.                         Zambia 
##                    0.005586592                    0.005586592 
##                       Zimbabwe 
##                    0.005586592 
## 
## 
## $Region
## $Region$table
## x
##                        Africa                          Asia 
##                           816                           432 
## Central America and Caribbean                European Union 
##                           304                           432 
##                   Middle East                 North America 
##                           224                            48 
##                       Oceania                Rest of Europe 
##                           176                           240 
##                 South America 
##                           192 
## 
## $Region$proportion
## x
##                        Africa                          Asia 
##                    0.28491620                    0.15083799 
## Central America and Caribbean                European Union 
##                    0.10614525                    0.15083799 
##                   Middle East                 North America 
##                    0.07821229                    0.01675978 
##                       Oceania                Rest of Europe 
##                    0.06145251                    0.08379888 
##                 South America 
##                    0.06703911 
## 
## 
## $Economy_status_Developed
## $Economy_status_Developed$table
## x
##    0    1 
## 2272  592 
## 
## $Economy_status_Developed$proportion
## x
##         0         1 
## 0.7932961 0.2067039 
## 
## 
## $Economy_status_Developing
## $Economy_status_Developing$table
## x
##    0    1 
##  592 2272 
## 
## $Economy_status_Developing$proportion
## x
##         0         1 
## 0.2067039 0.7932961
ggplot(df, aes(x = Country)) +
  geom_bar() +
  coord_flip() +   # I tried to avoid label overlap, but the number of countries is so large that the plot still appears very crowded.
  theme_minimal() +
  labs(title = "Distribution of Countries",
       x = "Country",
       y = "Count")

ggplot(df, aes(x = Region)) +
  geom_bar() +
  coord_flip() +
  theme_minimal()

ggplot(df, aes(x = factor(Economy_status_Developed))) +
  geom_bar() +
  theme_minimal()

ggplot(df, aes(x = factor(Economy_status_Developing))) +
  geom_bar() +
  theme_minimal()

Year: Year ranges from 2000 to 2015 with each year represented fairly evenly across countries, producing a uniform distribution of observations over time.

Infant_deaths: Infant_deaths ranges from 1.8 to 138.1, with a median of 19.6 and a mean of 30.4. The strong right-skew reflects that while many countries report relatively low infant mortality, a smaller group experiences very high rates.

Measles: Measles shows a bimodal and right-skewed distribution, with many observations around 60–100 but also a long tail at lower values. Measles cases vary from 10 to 99, with a median of 83 and a mean of 77.3.

BMI: BMI ranges from 19.8 to 32.1, with a median of 25.5 and a mean of 25.0. BMI is approximately symmetric and centered around 24–27, indicating that most countries have population-average BMI in the overweight range. There are few extreme values and no apparent outliers.

Incidents_HIV: Incidents_HIV ranges from 0.01 to 21.68, but has a very low median (0.15) and mean (0.89), showing a highly right-skewed pattern dominated by near-zero values. A small number of countries experience disproportionately high HIV incidence.

GDP_per_capita: GDP_per_capita spans a very wide range from 148 to 112,418, with a median of 4,217 and a mean of 11,541. The severe right skew reflects large global economic disparities, with a few high-income countries driving the mean upward.

Thinness_five_nine_years: Thinness_five_nine_years ranges from 0.1 to 28.6, with a median of 3.4 and mean of 4.9. The distribution is right-skewed, indicating that most countries have low thinness rates (falling between 0 and 10%) while a smaller subset shows much higher prevalence.

Schooling: Schooling ranges from 1.1 to 14.1 years (slightly right-skewed), with a median of 7.8 and mean of 7.63. The values are broadly distributed, reflecting substantial variation in educational attainment across countries.

Life_expectancy: Life_expectancy ranges from 39.4 to 83.8 years, with a median of 71.4 and a mean of 68.9. The distribution is slightly left-skewed, with many countries concentrated in the 65–75 year range.

Country: The Country variable contains 179 distinct countries, and each country contributes repeated observations over time. For example, many countries appear 16 times in the dataset, leading to a wide spread of frequencies across levels.

Region: The Region variable categorizes countries into nine global areas. Africa accounts for the largest share with 816 records, followed by Asia and the European Union with 432 each, and Central America/Caribbean with 304 entries. These counts show that regional representation is uneven across the dataset.

Economy_status_Developed: This binary variable identifies whether an observation is from a developed economy. A total of 592 observations are marked as developed, while the remaining 2,272 are coded as non-developed, indicating that developed economies make up only a small portion of the data.

Economy_status_Developing: This indicator variable designates whether an observation belongs to a developing economy. Because it is defined as the complement of the “Developed” variable, the counts invert accordingly: 2,272 observations are coded as developing and 592 as non-developing. This aligns with the overall structure of the dataset, in which developing countries are more prevalent.

Model Regularization and Selection

Data Manipulation

# Create new variable indicating if Infant Deaths >= 30
df <- df %>%
  mutate(InfantDeaths_over30 = ifelse(Infant_deaths >= 30, 1, 0))

# Check result
table(df$InfantDeaths_over30)
## 
##    0    1 
## 1796 1068
# Restrict to only year 2010
df_2010 <- df %>%
  filter(Year == 2010)

# Check result
summary(df_2010$Year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2010    2010    2010    2010    2010    2010
# Create a subset excluding the specified variables
df_problem1 <- df_2010 %>%
  select(-Country, -Year, -Life_expectancy, -Infant_deaths)

# Check structure
str(df_problem1)
## 'data.frame':    179 obs. of  10 variables:
##  $ Region                   : Factor w/ 9 levels "Africa","Asia",..: 8 4 3 4 2 8 1 1 4 2 ...
##  $ Measles                  : int  83 93 91 94 93 97 64 64 96 85 ...
##  $ BMI                      : num  25.1 26.3 28.4 25.8 26.4 26.4 23 23.8 26.1 23.3 ...
##  $ Incidents_HIV            : num  0.06 0.03 0.37 0.08 0.1 0.04 2.41 5.85 0.02 0.16 ...
##  $ GDP_per_capita           : int  82027 37761 4768 48370 33437 72427 586 4147 20613 2235 ...
##  $ Thinness_five_nine_years : num  0.4 1.1 3.4 1.3 5.4 0.7 7.9 1.4 1.6 17.3 ...
##  $ Schooling                : num  13.3 13.8 10.5 12.3 8.8 12.7 2.6 6.2 12.1 2.3 ...
##  $ Economy_status_Developed : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 1 2 1 ...
##  $ Economy_status_Developing: Factor w/ 2 levels "0","1": 1 1 2 1 2 1 2 2 1 2 ...
##  $ InfantDeaths_over30      : num  0 0 0 0 0 0 1 1 0 1 ...

Training Validation Testing Sets

# training+validation = 75% vs. testing = 25%
set.seed(123)
trainval <- createDataPartition(y=df_problem1$InfantDeaths_over30, p=0.75, list=FALSE)

dftrval <- df_problem1[trainval,]
dftest <- df_problem1[-trainval,]

# training = 50% vs. Validation = 25%
set.seed(123)
train <- createDataPartition(y=dftrval$InfantDeaths_over30, p=2/3, list=FALSE)
dftr <- dftrval[train,]
dfval <- dftrval[-train,]

# Verify the data
dim(df); dim(dftrval); dim(dftest); dim(dftr); dim(dfval)
## [1] 2864   14
## [1] 135  10
## [1] 44 10
## [1] 90 10
## [1] 45 10

EDA

EDA - Continuous

# Check outcome variable
table(df_problem1$InfantDeaths_over30)
## 
##   0   1 
## 117  62
prop.table(table(df_problem1$InfantDeaths_over30))
## 
##         0         1 
## 0.6536313 0.3463687
# Identify numeric variables
num_vars <- df_problem1 %>% select_if(is.numeric) %>% select(-InfantDeaths_over30)

# Numeric summary by group
df_problem1 %>%
  group_by(InfantDeaths_over30) %>%
  summarise(across(where(is.numeric), 
                   list(mean = mean, sd = sd, median = median), 
                   .names = "{.col}_{.fn}"))
## # A tibble: 2 × 19
##   InfantDeaths_over30 Measles_mean Measles_sd Measles_median BMI_mean BMI_sd
##                 <dbl>        <dbl>      <dbl>          <dbl>    <dbl>  <dbl>
## 1                   0         84.5       16.4             91     26.3   1.53
## 2                   1         65.0       17.2             64     23.2   1.71
## # ℹ 13 more variables: BMI_median <dbl>, Incidents_HIV_mean <dbl>,
## #   Incidents_HIV_sd <dbl>, Incidents_HIV_median <dbl>,
## #   GDP_per_capita_mean <dbl>, GDP_per_capita_sd <dbl>,
## #   GDP_per_capita_median <dbl>, Thinness_five_nine_years_mean <dbl>,
## #   Thinness_five_nine_years_sd <dbl>, Thinness_five_nine_years_median <dbl>,
## #   Schooling_mean <dbl>, Schooling_sd <dbl>, Schooling_median <dbl>
# Boxplot
num_var_names <- names(num_vars)

for (v in num_var_names) {
  print(
    ggplot(df_problem1, aes(x = as.factor(InfantDeaths_over30), y = .data[[v]], fill = as.factor(InfantDeaths_over30))) +
      geom_boxplot() +
      labs(x = "Infant Deaths ≥ 30", y = v, title = paste("Boxplot of", v, "by Outcome")) +
      theme_minimal() +
      scale_fill_manual(values = c("#56B4E9", "#E69F00"))
  )
}

Measles: Countries with infant deaths >=30 tend to have lower measles values (median 64 vs. 91), and the boxplot shows a downward shift in the distribution. This suggests that higher infant mortality is associated with lower measles reporting rates or differing surveillance patterns.

BMI: BMI levels are lower in the >=30 infant death group (median 23.2 vs. 26.1), and the two boxplots barely overlap. This indicates that countries with higher infant mortality generally have lower population BMI.

Incidents_HIV: HIV incidence is markedly higher in countries with infant deaths >=30 (median ~1.1 vs. ~0.02), and the distribution is heavily right-skewed in this group. This pattern suggests that HIV burden is greater in settings where infant mortality is high.

GDP_per_capita: GDP per capita is dramatically lower in the high-infant-death group (median around 1,300 vs. 8,200), and the boxplot shows little overlap between groups. This strong contrast suggests that poorer countries experience far higher infant mortality.

Thinness_five_nine_years: Child thinness is substantially higher in the >=30 group (median ~8.1 vs. ~2.6), with the boxplot showing a clear upward shift. This indicates that undernutrition among children is strongly associated with higher infant deaths.

Schooling: Average years of schooling are much lower among countries with >=30 infant deaths (median ~4.2 vs. ~9.7), and the distributions are distinctly separated. This implies that lower educational attainment is strongly correlated with higher infant mortality.

EDA - Categorical

# Identify categorical variables
cat_vars <- df_problem1 %>% select(where(is.factor))

# Frequency tables by outcome
for (v in names(cat_vars)) {
  print(v)
  print(table(df_problem1[[v]], df_problem1$InfantDeaths_over30))
  print(prop.table(table(df_problem1[[v]], df_problem1$InfantDeaths_over30), margin = 2))
}
## [1] "Region"
##                                
##                                  0  1
##   Africa                         8 43
##   Asia                          15 12
##   Central America and Caribbean 18  1
##   European Union                27  0
##   Middle East                   13  1
##   North America                  3  0
##   Oceania                        8  3
##   Rest of Europe                15  0
##   South America                 10  2
##                                
##                                          0          1
##   Africa                        0.06837607 0.69354839
##   Asia                          0.12820513 0.19354839
##   Central America and Caribbean 0.15384615 0.01612903
##   European Union                0.23076923 0.00000000
##   Middle East                   0.11111111 0.01612903
##   North America                 0.02564103 0.00000000
##   Oceania                       0.06837607 0.04838710
##   Rest of Europe                0.12820513 0.00000000
##   South America                 0.08547009 0.03225806
## [1] "Economy_status_Developed"
##    
##      0  1
##   0 80 62
##   1 37  0
##    
##             0         1
##   0 0.6837607 1.0000000
##   1 0.3162393 0.0000000
## [1] "Economy_status_Developing"
##    
##      0  1
##   0 37  0
##   1 80 62
##    
##             0         1
##   0 0.3162393 0.0000000
##   1 0.6837607 1.0000000

Region: Africa accounts for most observations with infant deaths >=30 (43 out of 62), while regions such as Europe, North America, and Rest of Europe contribute none. This indicates that high infant mortality is geographically concentrated, especially in Africa and parts of Asia.

Economy_status_Developed: All observations classified as developed economies fall in the <30 infant deaths group (37 vs. 0), showing that no developed country in the dataset experienced infant deaths >=30. This highlights a strong separation between developed status and high infant mortality.

Economy_status_Developing: The pattern reverses for developing economies: 62 of the 62 high–infant-death observations come from developing countries. This reinforces the relationship between lower economic development and elevated infant mortality.

Model Building

Ridge: Cross validation, lambda, graphics

# Train+validation -> x, y
x_trval <- model.matrix(InfantDeaths_over30 ~ ., data = dftrval)[, -1]
y_trval <- dftrval$InfantDeaths_over30  

set.seed(123)


# Lambda grid
grid <- c(10^seq(4, -2, length = 99), 0)
# Print
grid
##   [1] 1.000000e+04 8.685114e+03 7.543120e+03 6.551286e+03 5.689866e+03
##   [6] 4.941713e+03 4.291934e+03 3.727594e+03 3.237458e+03 2.811769e+03
##  [11] 2.442053e+03 2.120951e+03 1.842070e+03 1.599859e+03 1.389495e+03
##  [16] 1.206793e+03 1.048113e+03 9.102982e+02 7.906043e+02 6.866488e+02
##  [21] 5.963623e+02 5.179475e+02 4.498433e+02 3.906940e+02 3.393222e+02
##  [26] 2.947052e+02 2.559548e+02 2.222996e+02 1.930698e+02 1.676833e+02
##  [31] 1.456348e+02 1.264855e+02 1.098541e+02 9.540955e+01 8.286428e+01
##  [36] 7.196857e+01 6.250552e+01 5.428675e+01 4.714866e+01 4.094915e+01
##  [41] 3.556480e+01 3.088844e+01 2.682696e+01 2.329952e+01 2.023590e+01
##  [46] 1.757511e+01 1.526418e+01 1.325711e+01 1.151395e+01 1.000000e+01
##  [51] 8.685114e+00 7.543120e+00 6.551286e+00 5.689866e+00 4.941713e+00
##  [56] 4.291934e+00 3.727594e+00 3.237458e+00 2.811769e+00 2.442053e+00
##  [61] 2.120951e+00 1.842070e+00 1.599859e+00 1.389495e+00 1.206793e+00
##  [66] 1.048113e+00 9.102982e-01 7.906043e-01 6.866488e-01 5.963623e-01
##  [71] 5.179475e-01 4.498433e-01 3.906940e-01 3.393222e-01 2.947052e-01
##  [76] 2.559548e-01 2.222996e-01 1.930698e-01 1.676833e-01 1.456348e-01
##  [81] 1.264855e-01 1.098541e-01 9.540955e-02 8.286428e-02 7.196857e-02
##  [86] 6.250552e-02 5.428675e-02 4.714866e-02 4.094915e-02 3.556480e-02
##  [91] 3.088844e-02 2.682696e-02 2.329952e-02 2.023590e-02 1.757511e-02
##  [96] 1.526418e-02 1.325711e-02 1.151395e-02 1.000000e-02 0.000000e+00
# Cross-validation to find optimal lambda
ridge_cv <- cv.glmnet(x = x_trval, y = y_trval, family = "binomial", lambda = grid, alpha = 0)
names(ridge_cv)
##  [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
##  [6] "nzero"      "call"       "name"       "glmnet.fit" "lambda.min"
## [11] "lambda.1se" "index"
# Plots
plot(ridge_cv)
title("Ridge CV - binomial", line = 2.5)

# Two optimal lambdas
ridge_lambda_min  <- ridge_cv$lambda.min
ridge_lambda_min
## [1] 0.01526418
-log(ridge_lambda_min)
## [1] 4.182246
ridge_lambda_1se  <- ridge_cv$lambda.1se
ridge_lambda_1se
## [1] 0.2559548
-log(ridge_lambda_1se)
## [1] 1.362754
ridge_cv$cvm
##   [1] 1.3337705 1.3337528 1.3337325 1.3337091 1.3336821 1.3336511 1.3336154
##   [8] 1.3335742 1.3335269 1.3334723 1.3334096 1.3333373 1.3332541 1.3331583
##  [15] 1.3330481 1.3329212 1.3327752 1.3326071 1.3324136 1.3321910 1.3319348
##  [22] 1.3316401 1.3313009 1.3309108 1.3304621 1.3299461 1.3293528 1.3286707
##  [29] 1.3278868 1.3269861 1.3259516 1.3247637 1.3234003 1.3218363 1.3200493
##  [36] 1.3179987 1.3156493 1.3129619 1.3098910 1.3063857 1.3023897 1.2978407
##  [43] 1.2926709 1.2868065 1.2801682 1.2726719 1.2642296 1.2547508 1.2441444
##  [50] 1.2323216 1.2191981 1.2046989 1.1887614 1.1713407 1.1524135 1.1319825
##  [57] 1.1100799 1.0867701 1.0621507 1.0363516 1.0095333 0.9818822 0.9536062
##  [64] 0.9249274 0.8960759 0.8672824 0.8387743 0.8107708 0.7834646 0.7570228
##  [71] 0.7316029 0.7073369 0.6843284 0.6626562 0.6423745 0.6235148 0.6060878
##  [78] 0.5900852 0.5754822 0.5622400 0.5503081 0.5396260 0.5301266 0.5217374
##  [85] 0.5143838 0.5079898 0.5024860 0.4977902 0.4938350 0.4905799 0.4879432
##  [92] 0.4858592 0.4842958 0.4831613 0.4825218 0.4822892 0.4824458 0.4829522
##  [99] 0.4837880 0.9905219
ridge_cv$index
##     Lambda
## min     96
## 1se     76
ridge_cv$cvm[ridge_cv$index]
## [1] 0.4822892 0.6235148
coef(ridge_cv, c(ridge_lambda_min, ridge_lambda_1se))
## 17 x 2 sparse Matrix of class "dgCMatrix"
##                                      s=0.01526418  s=0.25595479
## (Intercept)                          1.355210e+01  5.911226e+00
## RegionAsia                           2.878540e-01  8.542672e-02
## RegionCentral America and Caribbean -2.423031e+00 -7.544565e-01
## RegionEuropean Union                -5.856813e-01 -3.168081e-01
## RegionMiddle East                   -1.588611e+00 -5.932371e-01
## RegionNorth America                 -1.426218e-01 -1.770785e-01
## RegionOceania                       -7.510754e-01 -2.489412e-01
## RegionRest of Europe                -1.527645e+00 -5.132385e-01
## RegionSouth America                 -1.338828e-01 -2.134472e-01
## Measles                             -1.621042e-02 -1.342859e-02
## BMI                                 -4.489189e-01 -1.951587e-01
## Incidents_HIV                        4.158642e-01  1.375514e-01
## GDP_per_capita                      -4.072031e-05 -1.074347e-05
## Thinness_five_nine_years            -1.682990e-02  4.135954e-02
## Schooling                           -2.676139e-01 -1.158447e-01
## Economy_status_Developed1           -6.274822e-01 -3.210898e-01
## Economy_status_Developing1           6.269228e-01  3.210882e-01
ridge_fit <- glmnet(x_trval, y_trval,
                    family = "binomial",
                    alpha = 0,
                    lambda = grid)

plot(ridge_fit, xvar = "lambda")
title("Ridge coefficient paths", line = 2.5)

Comment: Using cross-validated Ridge regression on the combined training/validation set, the model identified λ_min = 0.0153 as the tuning parameter that minimizes the binomial deviance, with a corresponding deviance of approximately 0.48. The more conservative λ_1se = 0.2559 produced a slightly higher deviance (~0.62) but yields a simpler and more regularized model.

The cross-validation plot showed a clear decrease in deviance as penalization weakens (moving toward smaller λ values), after which the curve flattens, indicating diminishing returns in model improvement.The coefficient-path plot demonstrated the characteristic Ridge pattern where all coefficients smoothly shrunk toward zero as λ increases, without any coefficients dropping exactly to zero.

Overall, the Ridge model suggested that variables such as GDP per capita, schooling, regional indicators, and HIV incidence contributed meaningfully to predicting whether infant deaths exceed 30, though effect sizes remained heavily regularized due to the penalty.

Ridge: coefficient table

# Create a logistic baseline model
logit_std <- glm(InfantDeaths_over30 ~ ., data = dftrval, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
logit_sum <- summary(logit_std)$coefficients

# Extract coefficients at optimal lambda
ridge_coef <- coef(ridge_cv, s = "lambda.min")
ridge_coef
## 17 x 1 sparse Matrix of class "dgCMatrix"
##                                        lambda.min
## (Intercept)                          1.355210e+01
## RegionAsia                           2.878540e-01
## RegionCentral America and Caribbean -2.423031e+00
## RegionEuropean Union                -5.856813e-01
## RegionMiddle East                   -1.588611e+00
## RegionNorth America                 -1.426218e-01
## RegionOceania                       -7.510754e-01
## RegionRest of Europe                -1.527645e+00
## RegionSouth America                 -1.338828e-01
## Measles                             -1.621042e-02
## BMI                                 -4.489189e-01
## Incidents_HIV                        4.158642e-01
## GDP_per_capita                      -4.072031e-05
## Thinness_five_nine_years            -1.682990e-02
## Schooling                           -2.676139e-01
## Economy_status_Developed1           -6.274822e-01
## Economy_status_Developing1           6.269228e-01
# Convert sparse matrix to vector
ridge_coef_vec <- as.numeric(ridge_coef)
names(ridge_coef_vec) <- rownames(ridge_coef)

# Combine both into a single data frame
coef_table_ridge <- data.frame(
  term = rownames(logit_sum),
  Estimate_logit = logit_sum[, "Estimate"],
  StdError_logit = logit_sum[, "Std. Error"],
  zvalue_logit   = logit_sum[, "z value"],
  pvalue_logit   = logit_sum[, "Pr(>|z|)"],
  Ridge_coef     = ridge_coef_vec[rownames(logit_sum)]
)

coef_table_ridge
##                                                                    term
## (Intercept)                                                 (Intercept)
## RegionAsia                                                   RegionAsia
## RegionCentral America and Caribbean RegionCentral America and Caribbean
## RegionEuropean Union                               RegionEuropean Union
## RegionMiddle East                                     RegionMiddle East
## RegionNorth America                                 RegionNorth America
## RegionOceania                                             RegionOceania
## RegionRest of Europe                               RegionRest of Europe
## RegionSouth America                                 RegionSouth America
## Measles                                                         Measles
## BMI                                                                 BMI
## Incidents_HIV                                             Incidents_HIV
## GDP_per_capita                                           GDP_per_capita
## Thinness_five_nine_years                       Thinness_five_nine_years
## Schooling                                                     Schooling
## Economy_status_Developed1                     Economy_status_Developed1
##                                     Estimate_logit StdError_logit  zvalue_logit
## (Intercept)                            2.882520750   8.627514e+00  0.3341079200
## RegionAsia                             1.739465362   1.703934e+00  1.0208526882
## RegionCentral America and Caribbean  -20.736333530   6.104900e+03 -0.0033966706
## RegionEuropean Union                 -29.650613002   9.062427e+03 -0.0032718182
## RegionMiddle East                    -15.680368591   5.421936e+03 -0.0028920237
## RegionNorth America                   11.284343170   1.986258e+04  0.0005681207
## RegionOceania                         -0.667779777   1.354853e+00 -0.4928798237
## RegionRest of Europe                 -18.314231089   7.532661e+03 -0.0024313097
## RegionSouth America                    2.828459661   1.842900e+00  1.5347877436
## Measles                                0.011526007   3.211069e-02  0.3589460730
## BMI                                    0.026456018   3.555306e-01  0.0744127722
## Incidents_HIV                          3.917870992   1.578883e+00  2.4814201324
## GDP_per_capita                        -0.001254931   5.428571e-04 -2.3117157239
## Thinness_five_nine_years               0.001535216   1.135536e-01  0.0135197447
## Schooling                             -0.390462028   3.288820e-01 -1.1872403508
## Economy_status_Developed1             19.726862727   8.485774e+03  0.0023246981
##                                     pvalue_logit    Ridge_coef
## (Intercept)                           0.73829812  1.355210e+01
## RegionAsia                            0.30732424  2.878540e-01
## RegionCentral America and Caribbean   0.99728985 -2.423031e+00
## RegionEuropean Union                  0.99738947 -5.856813e-01
## RegionMiddle East                     0.99769250 -1.588611e+00
## RegionNorth America                   0.99954671 -1.426218e-01
## RegionOceania                         0.62209750 -7.510754e-01
## RegionRest of Europe                  0.99806010 -1.527645e+00
## RegionSouth America                   0.12483597 -1.338828e-01
## Measles                               0.71963543 -1.621042e-02
## BMI                                   0.94068195 -4.489189e-01
## Incidents_HIV                         0.01308600  4.158642e-01
## GDP_per_capita                        0.02079335 -4.072031e-05
## Thinness_five_nine_years              0.98921313 -1.682990e-02
## Schooling                             0.23513283 -2.676139e-01
## Economy_status_Developed1             0.99814516 -6.274822e-01
coef_table_ridge %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(abs(Ridge_coef)))
##                                                                    term
## RegionCentral America and Caribbean RegionCentral America and Caribbean
## RegionMiddle East                                     RegionMiddle East
## RegionRest of Europe                               RegionRest of Europe
## RegionOceania                                             RegionOceania
## Economy_status_Developed1                     Economy_status_Developed1
## RegionEuropean Union                               RegionEuropean Union
## BMI                                                                 BMI
## Incidents_HIV                                             Incidents_HIV
## RegionAsia                                                   RegionAsia
## Schooling                                                     Schooling
## RegionNorth America                                 RegionNorth America
## RegionSouth America                                 RegionSouth America
## Thinness_five_nine_years                       Thinness_five_nine_years
## Measles                                                         Measles
## GDP_per_capita                                           GDP_per_capita
##                                     Estimate_logit StdError_logit  zvalue_logit
## RegionCentral America and Caribbean  -20.736333530   6.104900e+03 -0.0033966706
## RegionMiddle East                    -15.680368591   5.421936e+03 -0.0028920237
## RegionRest of Europe                 -18.314231089   7.532661e+03 -0.0024313097
## RegionOceania                         -0.667779777   1.354853e+00 -0.4928798237
## Economy_status_Developed1             19.726862727   8.485774e+03  0.0023246981
## RegionEuropean Union                 -29.650613002   9.062427e+03 -0.0032718182
## BMI                                    0.026456018   3.555306e-01  0.0744127722
## Incidents_HIV                          3.917870992   1.578883e+00  2.4814201324
## RegionAsia                             1.739465362   1.703934e+00  1.0208526882
## Schooling                             -0.390462028   3.288820e-01 -1.1872403508
## RegionNorth America                   11.284343170   1.986258e+04  0.0005681207
## RegionSouth America                    2.828459661   1.842900e+00  1.5347877436
## Thinness_five_nine_years               0.001535216   1.135536e-01  0.0135197447
## Measles                                0.011526007   3.211069e-02  0.3589460730
## GDP_per_capita                        -0.001254931   5.428571e-04 -2.3117157239
##                                     pvalue_logit    Ridge_coef
## RegionCentral America and Caribbean   0.99728985 -2.423031e+00
## RegionMiddle East                     0.99769250 -1.588611e+00
## RegionRest of Europe                  0.99806010 -1.527645e+00
## RegionOceania                         0.62209750 -7.510754e-01
## Economy_status_Developed1             0.99814516 -6.274822e-01
## RegionEuropean Union                  0.99738947 -5.856813e-01
## BMI                                   0.94068195 -4.489189e-01
## Incidents_HIV                         0.01308600  4.158642e-01
## RegionAsia                            0.30732424  2.878540e-01
## Schooling                             0.23513283 -2.676139e-01
## RegionNorth America                   0.99954671 -1.426218e-01
## RegionSouth America                   0.12483597 -1.338828e-01
## Thinness_five_nine_years              0.98921313 -1.682990e-02
## Measles                               0.71963543 -1.621042e-02
## GDP_per_capita                        0.02079335 -4.072031e-05

Comment: Using the logistic regression coefficients as a reference, I printed a coefficient table that also includes the Ridge regression coefficients evaluated at the optimal λ (lambda.min = 0.0153).

When ranking predictors by the absolute value of the Ridge coefficients, the largest coefficients were observed for regional indicators, particularly Central America and Caribbean, Middle East, Rest of Europe, and Oceania. This could mean that geographic factor plays a major role in distinguishing countries with >=30 infant deaths. Economic development status (Developed vs. Developing), BMI, and HIV incidence also showed relatively large Ridge coefficients. However, GDP per capita, Measles, and Thinness (5-9 years) had coefficients very close to 0 and appeared less important after regularization.

LASSO (Lambda and graphics)

set.seed(123)

# Lambda grid
grid <- c(10^seq(4, -2, length = 99), 0)
# Print
grid
##   [1] 1.000000e+04 8.685114e+03 7.543120e+03 6.551286e+03 5.689866e+03
##   [6] 4.941713e+03 4.291934e+03 3.727594e+03 3.237458e+03 2.811769e+03
##  [11] 2.442053e+03 2.120951e+03 1.842070e+03 1.599859e+03 1.389495e+03
##  [16] 1.206793e+03 1.048113e+03 9.102982e+02 7.906043e+02 6.866488e+02
##  [21] 5.963623e+02 5.179475e+02 4.498433e+02 3.906940e+02 3.393222e+02
##  [26] 2.947052e+02 2.559548e+02 2.222996e+02 1.930698e+02 1.676833e+02
##  [31] 1.456348e+02 1.264855e+02 1.098541e+02 9.540955e+01 8.286428e+01
##  [36] 7.196857e+01 6.250552e+01 5.428675e+01 4.714866e+01 4.094915e+01
##  [41] 3.556480e+01 3.088844e+01 2.682696e+01 2.329952e+01 2.023590e+01
##  [46] 1.757511e+01 1.526418e+01 1.325711e+01 1.151395e+01 1.000000e+01
##  [51] 8.685114e+00 7.543120e+00 6.551286e+00 5.689866e+00 4.941713e+00
##  [56] 4.291934e+00 3.727594e+00 3.237458e+00 2.811769e+00 2.442053e+00
##  [61] 2.120951e+00 1.842070e+00 1.599859e+00 1.389495e+00 1.206793e+00
##  [66] 1.048113e+00 9.102982e-01 7.906043e-01 6.866488e-01 5.963623e-01
##  [71] 5.179475e-01 4.498433e-01 3.906940e-01 3.393222e-01 2.947052e-01
##  [76] 2.559548e-01 2.222996e-01 1.930698e-01 1.676833e-01 1.456348e-01
##  [81] 1.264855e-01 1.098541e-01 9.540955e-02 8.286428e-02 7.196857e-02
##  [86] 6.250552e-02 5.428675e-02 4.714866e-02 4.094915e-02 3.556480e-02
##  [91] 3.088844e-02 2.682696e-02 2.329952e-02 2.023590e-02 1.757511e-02
##  [96] 1.526418e-02 1.325711e-02 1.151395e-02 1.000000e-02 0.000000e+00
# Cross-validation to find optimal lambda
lasso_cv <- cv.glmnet(x = x_trval, y = y_trval, family = "binomial", lambda = grid, alpha = 1)
names(lasso_cv)
##  [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
##  [6] "nzero"      "call"       "name"       "glmnet.fit" "lambda.min"
## [11] "lambda.1se" "index"
dim(coef(lasso_cv))
## [1] 17  1
# Plots
plot(lasso_cv)
title("LASSO CV - binomial", line = 2.5)

# Two optimal lambdas
lasso_lambda_min <- lasso_cv$lambda.min
lasso_lambda_min
## [1] 0.01
-log(lasso_lambda_min)
## [1] 4.60517
lasso_lambda_1se <- lasso_cv$lambda.1se
lasso_lambda_1se
## [1] 0.09540955
-log(lasso_lambda_1se)
## [1] 2.349577
lasso_cv$cvm
##   [1] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
##   [8] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
##  [15] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
##  [22] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
##  [29] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
##  [36] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
##  [43] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
##  [50] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
##  [57] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
##  [64] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
##  [71] 1.3338872 1.3338872 1.3338872 1.3070293 1.1754549 1.0597293 0.9682990
##  [78] 0.8948038 0.8350894 0.7866688 0.7407394 0.6946588 0.6579043 0.6285050
##  [85] 0.6050606 0.5845062 0.5666029 0.5518545 0.5391986 0.5282819 0.5194396
##  [92] 0.5129176 0.5078816 0.5042810 0.5011912 0.4986654 0.4968462 0.4946516
##  [99] 0.4935439 0.9606149
lasso_cv$index
##     Lambda
## min     99
## 1se     83
lasso_cv$cvm[lasso_cv$index]
## [1] 0.4935439 0.6579043
coef(lasso_cv, c(lasso_lambda_min, lasso_lambda_1se))
## 17 x 2 sparse Matrix of class "dgCMatrix"
##                                      s=0.01000000 s=0.09540955
## (Intercept)                          1.503035e+01  10.98572793
## RegionAsia                           2.260266e-01   .         
## RegionCentral America and Caribbean -2.323942e+00   .         
## RegionEuropean Union                 .              .         
## RegionMiddle East                   -4.813233e-01   .         
## RegionNorth America                  .              .         
## RegionOceania                       -1.710656e-01   .         
## RegionRest of Europe                -3.805341e-01   .         
## RegionSouth America                  1.646567e-01   .         
## Measles                             -7.084773e-03   .         
## BMI                                 -5.074938e-01  -0.40636130
## Incidents_HIV                        5.280790e-01   0.08492836
## GDP_per_capita                      -7.165106e-05   .         
## Thinness_five_nine_years             .              .         
## Schooling                           -3.139034e-01  -0.20142368
## Economy_status_Developed1            .              .         
## Economy_status_Developing1           .              .
lasso_fit <- glmnet(x_trval, y_trval,
                    family = "binomial",
                    alpha = 1,
                    lambda = grid)

plot(lasso_fit, xvar = "lambda")
title("Lasso coefficient paths", line = 2.5)

Comment: Using LASSO with cross validation, the optimal tuning parameter was lambda_min = 0.01, with a more parsimonious choice of lambda_1se = 0.0954. At lambda_min, several predictors have nonzero coefficients, while at lambda_1se only BMI, HIV incidence, and years of schooling remain in the model, indicating that these variables are the most important predictors of having 30 or more infant deaths. The minimum cross validated binomial deviance was about 0.49, and the 1SE had a slightly higher deviance of about 0.66.

LASSO Important Variables

# logistic baseline model summary
logit_sum <- summary(logit_std)$coefficients

# LASSO coefficients at optimal lambda (lambda.min)
lasso_coef <- coef(lasso_cv, s = lasso_lambda_min)
lasso_coef_vec <- as.numeric(lasso_coef)
names(lasso_coef_vec) <- rownames(lasso_coef)

# logistic + LASSO coefficient table
coef_table_lasso <- data.frame(
  term          = rownames(logit_sum),
  Estimate_logit = logit_sum[, "Estimate"],
  StdError_logit = logit_sum[, "Std. Error"],
  zvalue_logit   = logit_sum[, "z value"],
  pvalue_logit   = logit_sum[, "Pr(>|z|)"],
  LASSO_coef     = lasso_coef_vec[rownames(logit_sum)]
)

coef_table_lasso
##                                                                    term
## (Intercept)                                                 (Intercept)
## RegionAsia                                                   RegionAsia
## RegionCentral America and Caribbean RegionCentral America and Caribbean
## RegionEuropean Union                               RegionEuropean Union
## RegionMiddle East                                     RegionMiddle East
## RegionNorth America                                 RegionNorth America
## RegionOceania                                             RegionOceania
## RegionRest of Europe                               RegionRest of Europe
## RegionSouth America                                 RegionSouth America
## Measles                                                         Measles
## BMI                                                                 BMI
## Incidents_HIV                                             Incidents_HIV
## GDP_per_capita                                           GDP_per_capita
## Thinness_five_nine_years                       Thinness_five_nine_years
## Schooling                                                     Schooling
## Economy_status_Developed1                     Economy_status_Developed1
##                                     Estimate_logit StdError_logit  zvalue_logit
## (Intercept)                            2.882520750   8.627514e+00  0.3341079200
## RegionAsia                             1.739465362   1.703934e+00  1.0208526882
## RegionCentral America and Caribbean  -20.736333530   6.104900e+03 -0.0033966706
## RegionEuropean Union                 -29.650613002   9.062427e+03 -0.0032718182
## RegionMiddle East                    -15.680368591   5.421936e+03 -0.0028920237
## RegionNorth America                   11.284343170   1.986258e+04  0.0005681207
## RegionOceania                         -0.667779777   1.354853e+00 -0.4928798237
## RegionRest of Europe                 -18.314231089   7.532661e+03 -0.0024313097
## RegionSouth America                    2.828459661   1.842900e+00  1.5347877436
## Measles                                0.011526007   3.211069e-02  0.3589460730
## BMI                                    0.026456018   3.555306e-01  0.0744127722
## Incidents_HIV                          3.917870992   1.578883e+00  2.4814201324
## GDP_per_capita                        -0.001254931   5.428571e-04 -2.3117157239
## Thinness_five_nine_years               0.001535216   1.135536e-01  0.0135197447
## Schooling                             -0.390462028   3.288820e-01 -1.1872403508
## Economy_status_Developed1             19.726862727   8.485774e+03  0.0023246981
##                                     pvalue_logit    LASSO_coef
## (Intercept)                           0.73829812  1.503035e+01
## RegionAsia                            0.30732424  2.260266e-01
## RegionCentral America and Caribbean   0.99728985 -2.323942e+00
## RegionEuropean Union                  0.99738947  0.000000e+00
## RegionMiddle East                     0.99769250 -4.813233e-01
## RegionNorth America                   0.99954671  0.000000e+00
## RegionOceania                         0.62209750 -1.710656e-01
## RegionRest of Europe                  0.99806010 -3.805341e-01
## RegionSouth America                   0.12483597  1.646567e-01
## Measles                               0.71963543 -7.084773e-03
## BMI                                   0.94068195 -5.074938e-01
## Incidents_HIV                         0.01308600  5.280790e-01
## GDP_per_capita                        0.02079335 -7.165106e-05
## Thinness_five_nine_years              0.98921313  0.000000e+00
## Schooling                             0.23513283 -3.139034e-01
## Economy_status_Developed1             0.99814516  0.000000e+00
# Important coefficients
coef_table_lasso %>%
  dplyr::filter(term != "(Intercept)", LASSO_coef != 0) %>%
  dplyr::arrange(dplyr::desc(abs(LASSO_coef)))
##                                                                    term
## RegionCentral America and Caribbean RegionCentral America and Caribbean
## Incidents_HIV                                             Incidents_HIV
## BMI                                                                 BMI
## RegionMiddle East                                     RegionMiddle East
## RegionRest of Europe                               RegionRest of Europe
## Schooling                                                     Schooling
## RegionAsia                                                   RegionAsia
## RegionOceania                                             RegionOceania
## RegionSouth America                                 RegionSouth America
## Measles                                                         Measles
## GDP_per_capita                                           GDP_per_capita
##                                     Estimate_logit StdError_logit zvalue_logit
## RegionCentral America and Caribbean  -20.736333530   6.104900e+03 -0.003396671
## Incidents_HIV                          3.917870992   1.578883e+00  2.481420132
## BMI                                    0.026456018   3.555306e-01  0.074412772
## RegionMiddle East                    -15.680368591   5.421936e+03 -0.002892024
## RegionRest of Europe                 -18.314231089   7.532661e+03 -0.002431310
## Schooling                             -0.390462028   3.288820e-01 -1.187240351
## RegionAsia                             1.739465362   1.703934e+00  1.020852688
## RegionOceania                         -0.667779777   1.354853e+00 -0.492879824
## RegionSouth America                    2.828459661   1.842900e+00  1.534787744
## Measles                                0.011526007   3.211069e-02  0.358946073
## GDP_per_capita                        -0.001254931   5.428571e-04 -2.311715724
##                                     pvalue_logit    LASSO_coef
## RegionCentral America and Caribbean   0.99728985 -2.323942e+00
## Incidents_HIV                         0.01308600  5.280790e-01
## BMI                                   0.94068195 -5.074938e-01
## RegionMiddle East                     0.99769250 -4.813233e-01
## RegionRest of Europe                  0.99806010 -3.805341e-01
## Schooling                             0.23513283 -3.139034e-01
## RegionAsia                            0.30732424  2.260266e-01
## RegionOceania                         0.62209750 -1.710656e-01
## RegionSouth America                   0.12483597  1.646567e-01
## Measles                               0.71963543 -7.084773e-03
## GDP_per_capita                        0.02079335 -7.165106e-05

Comment: Using the optimal lambda (λmin = 0.01), the most important variables, based on the magnitude of the LASSO coefficients, were Region (Central America and Caribbean, Middle East, Rest of Europe, Asia, Oceania, South America), Incidents_HIV, BMI, and Schooling. All other predictors were shrunk to zero, indicating limited additional predictive value for infant deaths >=30.

Final Model

### Final model: LASSO with lambda.1se

# Final lambda selection (lasso, lambda.1se)
lasso_lambda_1se <- lasso_cv$lambda.1se

# test set model matrix and outcome
x_test <- model.matrix(InfantDeaths_over30 ~ ., data = dftest)[, -1]
y_test <- dftest$InfantDeaths_over30

# testset probability
test_prob <- as.numeric(
  predict(lasso_cv,
          s    = lasso_lambda_1se,
          newx = x_test,
          type = "response")
)

# test MSE
test_mse <- mean( (test_prob - y_test)^2 )
test_mse
## [1] 0.09311305
# 0.5 cut-off
test_pred_class <- ifelse(test_prob > 0.5, 1, 0)

# test error
test_error <- mean(test_pred_class != y_test)
test_error
## [1] 0.06818182
plot_df <- data.frame(
  InfantDeaths_over30 = factor(y_test, levels = c(0, 1)),
  prob = test_prob
)

ggplot(plot_df, aes(x = InfantDeaths_over30, y = prob)) +
  geom_boxplot() +
  geom_jitter(width = 0.1, alpha = 0.4) +
  labs(
    x = "Observed InfantDeaths_over30",
    y = "Predicted probability",
    title = "LASSO (lambda.1se) predicted probabilities on the test set"
  ) +
  theme_minimal()

Comment: I selected the LASSO model with λ.1se as the final model because it provided a more parsimonious model while maintaining similar predictive performance to λ.min. The λ.1se model also reduced the risk of overfitting, which is important given the relatively small sample size in the training and validation sets. The final LASSO model achieved a test MSE of 0.093 and a misclassification error rate of 6.8%, suggesting good predictive accuracy. The predicted probabilities showed clear separation between the two outcome groups. Countries with fewer than 30 infant deaths predicted probabilities mostly below 0.3, whereas those with 30 or more deaths clustered around 0.6–0.8. This indicates that the model captures meaningful differences between groups and fits the data well.

Interpretation

Explanation for a practitioner

Based on the LASSO model using the λ.1se penalty, several variables emerge as meaningful predictors of whether a country experiences 30 or more infant deaths. Because LASSO shrinks unimportant predictors exactly to zero, the remaining non-zero coefficients represent variables that contribute most strongly to distinguishing high-mortality countries from low-mortality countries.

Countries in regions such as Central America and the Caribbean, the Middle East, and Oceania show substantially different infant-mortality risks compared with the reference region, even after accounting for other variables. These region effects likely reflect broader structural factors such as health-system capacity, access to prenatal care, or public-health infrastructure.

Among continuous predictors, higher HIV incidence and lower BMI and schooling levels were associated with higher odds of having 30 or more infant deaths. These associations are consistent with known public-health patterns: HIV burden can strain health systems and contribute to poorer maternal and neonatal outcomes, while low schooling levels often reflect underlying socioeconomic disadvantage and reduced access to preventive services. Lower average BMI may also indicate nutritional challenges or food insecurity, which can indirectly affect infant health.

Importantly, LASSO focuses on predictive importance rather than statistical significance. Thus, the variables retained by the model should be interpreted as the most informative for distinguishing high-risk countries, not necessarily as causal factors. Nonetheless, the results highlight key structural and health-system indicators that tend to differentiate countries with high infant mortality from those with lower rates. For practitioners, these findings can help prioritize areas for intervention, such as strengthening HIV prevention and treatment programs, improving maternal education, and addressing regional health-system disparities.

Splines

Training Validation Test Sets

# training+validation = 75% vs. testing = 25%
set.seed(123)
trainval <- createDataPartition(y=df$Life_expectancy, p=0.75, list=FALSE)

dftrval <- df[trainval,]
dftest <- df[-trainval,]

# training = 50% vs. Validation = 25%
set.seed(123)
train <- createDataPartition(y=dftrval$Life_expectancy, p=2/3, list=FALSE)
dftr <- dftrval[train,]
dfval <- dftrval[-train,]

# Verify the data
dim(df); dim(dftrval); dim(dftest); dim(dftr); dim(dfval)
## [1] 2864   14
## [1] 2150   14
## [1] 714  14
## [1] 1435   14
## [1] 715  14

EDA

# Remove Country
dftrval <- dftrval %>%
  select(-Country)

dftest <- dftest %>%
  select(-Country)

dftr <- dftr %>%
  select(-Country)

dfval <- dfval %>%
  select(-Country)

# Check structure
str(dftrval)
## 'data.frame':    2150 obs. of  13 variables:
##  $ Region                   : Factor w/ 9 levels "Africa","Asia",..: 4 2 9 5 3 8 4 5 8 9 ...
##  $ Year                     : int  2015 2007 2006 2012 2006 2015 2000 2001 2008 2012 ...
##  $ Infant_deaths            : num  2.7 51.5 32.8 3.4 9.8 6.6 8.7 22 15.3 15.4 ...
##  $ Measles                  : int  94 35 74 89 86 97 99 87 92 70 ...
##  $ BMI                      : num  26 21.2 25.3 27 26.4 26.2 25.9 27.9 26.5 26.1 ...
##  $ Incidents_HIV            : num  0.09 0.13 0.79 0.08 0.16 0.08 0.08 0.13 0.43 0.24 ...
##  $ GDP_per_capita           : int  25742 1076 4146 33995 9110 9313 8971 3708 2235 9057 ...
##  $ Thinness_five_nine_years : num  0.5 28 5.5 1.1 1.9 2.3 2.3 3.9 3.1 2.8 ...
##  $ Schooling                : num  9.7 5 7.9 12.8 7.9 12 10.2 9.6 10.9 7.3 ...
##  $ Economy_status_Developed : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 1 1 1 ...
##  $ Economy_status_Developing: Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 2 2 2 ...
##  $ Life_expectancy          : num  82.8 65.4 67 81.7 78.2 71.2 71.2 71.9 68.7 74.2 ...
##  $ InfantDeaths_over30      : num  0 1 1 0 0 0 0 0 0 0 ...

Continuous variables

library(dplyr)
library(splines)
library(boot)

# 1) smoothing spline + MSE
# 2) natural spline (# of knots as CV) + MSE
# 3) train data scatter plot + two lines

fit_plot_splines <- function(xvar, df_train, df_val) {
  # x, y
  x_train <- df_train[[xvar]]
  y_train <- df_train$Life_expectancy
  x_val   <- df_val[[xvar]]
  y_val   <- df_val$Life_expectancy
  
  ## 1. smoothing spline (df: cv=TRUE)
  ss_fit <- smooth.spline(x = x_train, y = y_train, cv = TRUE)
  ss_df  <- ss_fit$df
  
  # MSE
  pred_ss_val <- predict(ss_fit, x = x_val)$y
  MSE_ss <- mean((y_val - pred_ss_val)^2)
  
  ## 2. natural spline (# of knots as CV)
  cand_knots <- 3:8                     
  cv_mse <- rep(NA, length(cand_knots))
  
  df_train_tmp <- df_train %>%
    mutate(x_tmp = .data[[xvar]])
  
  for (i in seq_along(cand_knots)) {
    k <- cand_knots[i]
    # equally spaced quantiles: set interior knots
    probs   <- seq(0, 1, length.out = k + 2)
    knots_i <- quantile(x_train, probs = probs)[-c(1, k + 2)]
    
    fit_tmp <- glm(Life_expectancy ~ ns(x_tmp, knots = knots_i),
                   data = df_train_tmp)
    
    cv_res <- cv.glm(df_train_tmp, fit_tmp, K = 10)
    cv_mse[i] <- cv_res$delta[1]
  }
  
  best_k <- cand_knots[which.min(cv_mse)]
  probs_best   <- seq(0, 1, length.out = best_k + 2)
  best_knots   <- quantile(x_train, probs = probs_best)[-c(1, best_k + 2)]
  
  ns_fit <- lm(Life_expectancy ~ ns(x_tmp, knots = best_knots),
               data = df_train_tmp)
  
  df_val_tmp <- df_val %>%
    mutate(x_tmp = .data[[xvar]])
  pred_ns_val <- predict(ns_fit, newdata = df_val_tmp)
  MSE_ns <- mean((y_val - pred_ns_val)^2)
  
  ## 3. Plots (training data)
  x_grid <- seq(min(x_train), max(x_train), length.out = 100)
  ss_grid <- predict(ss_fit, x_grid)$y
  
  grid_df <- data.frame(x_tmp = x_grid)
  ns_grid <- predict(ns_fit, newdata = grid_df)
  
  plot(x_train, y_train,
       xlab = xvar,
       ylab = "Life_expectancy",
       main = paste("Life_expectancy vs", xvar))
  lines(x_grid, ss_grid, col = "blue", lwd = 2)
  lines(x_grid, ns_grid, col = "red",  lwd = 2)
  legend("topleft",
         legend = c(
           paste0("Smoothing spline (df = ", round(ss_df, 1), ")"),
           paste0("Natural spline (knots = ", best_k, ")")
         ),
         col = c("blue", "red"),
         lty = 1, lwd = 2, bty = "n")
  
  # Values for the table
  out <- list(
    var       = xvar,
    df_ss     = ss_df,
    MSE_ss    = MSE_ss,
    knots_ns  = best_k,
    MSE_ns    = MSE_ns
  )
  return(out)
}

Year

res_year <- fit_plot_splines("Year", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful

res_year
## $var
## [1] "Year"
## 
## $df_ss
## [1] 2.000005
## 
## $MSE_ss
## [1] 84.61247
## 
## $knots_ns
## [1] 7
## 
## $MSE_ns
## [1] 85.19888

Infant Deaths

res_infant <- fit_plot_splines("Infant_deaths", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful

res_infant
## $var
## [1] "Infant_deaths"
## 
## $df_ss
## [1] 22.69609
## 
## $MSE_ss
## [1] 10.5828
## 
## $knots_ns
## [1] 5
## 
## $MSE_ns
## [1] 10.54443

GDP per capita

res_gdp <- fit_plot_splines("GDP_per_capita", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful

res_gdp
## $var
## [1] "GDP_per_capita"
## 
## $df_ss
## [1] 27.06004
## 
## $MSE_ss
## [1] 29.62522
## 
## $knots_ns
## [1] 4
## 
## $MSE_ns
## [1] 29.45887

Measles

res_measles <- fit_plot_splines("Measles", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful

res_measles
## $var
## [1] "Measles"
## 
## $df_ss
## [1] 21.67301
## 
## $MSE_ss
## [1] 43.54895
## 
## $knots_ns
## [1] 8
## 
## $MSE_ns
## [1] 49.43904

BMI

res_bmi <- fit_plot_splines("BMI", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful

res_bmi
## $var
## [1] "BMI"
## 
## $df_ss
## [1] 29.01203
## 
## $MSE_ss
## [1] 42.95983
## 
## $knots_ns
## [1] 8
## 
## $MSE_ns
## [1] 43.90295

HIV/AIDS

res_hiv <- fit_plot_splines("Incidents_HIV", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful

res_hiv
## $var
## [1] "Incidents_HIV"
## 
## $df_ss
## [1] 9.369317
## 
## $MSE_ss
## [1] 40.8478
## 
## $knots_ns
## [1] 8
## 
## $MSE_ns
## [1] 38.76133

Schooling

res_school <- fit_plot_splines("Schooling", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful

res_school
## $var
## [1] "Schooling"
## 
## $df_ss
## [1] 40.45272
## 
## $MSE_ss
## [1] 37.6862
## 
## $knots_ns
## [1] 8
## 
## $MSE_ns
## [1] 37.33973

Thinness 5-9 years

res_thin <- fit_plot_splines("Thinness_five_nine_years", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful

res_thin
## $var
## [1] "Thinness_five_nine_years"
## 
## $df_ss
## [1] 34.64684
## 
## $MSE_ss
## [1] 46.68214
## 
## $knots_ns
## [1] 8
## 
## $MSE_ns
## [1] 50.23682

Null (Intercept) Model

# Null model: intercept only
null_fit <- lm(Life_expectancy ~ 1, data = dftr)

# MSE
null_pred_val <- predict(null_fit, newdata = dfval)
MSE_null <- mean((dfval$Life_expectancy - null_pred_val)^2)

MSE_null
## [1] 87.29366
Variable Name MSE from Smoothing Spline S moothing Spline (SS) DF MSE from ns() # knots for ns() Better parameterization?
Intercept only MSE intercept only NULL MSE=87.3
Year 84.6 df= 2.0 85.3 ns() # knot= 5 SS
Infant_ deaths 10.6 df= 22.7 10.5 ns() # knot= 8 NS
GDP_per_ capita 29.6 df= 27.1 29.5 ns() # knot= 4 NS
Measles 43.5 df= 21.7 49.4 ns() # knot= 8 SS
BMI 43.0 df= 29.0 43.9 ns() # knot= 8 SS
Incidents_ HIV 40.8 df= 9.4 38.8 ns() # knot= 8 NS
Scholling 37.7 df= 40.5 37.3 ns() # knot= 8 NS
Thinness_ five_nine_ years 46.7 df= 34.6 50.2 ns() # knot= 8 SS

Categorical variables

Variable Name MSE
Region 32.76
Econ_Developed 64.43
Econ_Developing 64.43
cat_vars <- names(dftr)[sapply(dftr, function(x) is.factor(x) | is.character(x))]
cat_vars <- setdiff(cat_vars, "Life_expectancy")

cat_vars
## [1] "Region"                    "Economy_status_Developed" 
## [3] "Economy_status_Developing"
# Boxplots
for (v in cat_vars) {
  boxplot(dftr$Life_expectancy ~ dftr[[v]],
          main = paste("Life_expectancy by", v),
          xlab = v,
          ylab = "Life_expectancy")
}

# regression analysis + validation MSE
mse_cat <- data.frame(
  Variable = cat_vars,
  MSE      = NA_real_,
  stringsAsFactors = FALSE
)

for (i in seq_along(cat_vars)) {
  v <- cat_vars[i]
  
  # Life_expectancy ~ categorical variable
  fml <- as.formula(paste("Life_expectancy ~", v))
  
  fit <- lm(fml, data = dftr)
  
  pred_val <- predict(fit, newdata = dfval)
  
  mse_cat$MSE[i] <- mean((dfval$Life_expectancy - pred_val)^2)
}

mse_cat
##                    Variable      MSE
## 1                    Region 32.75974
## 2  Economy_status_Developed 64.43176
## 3 Economy_status_Developing 64.43176

EDA Important Variables

Based on the univariate analyses, Infant_deaths showed the strongest relationship with Life_expectancy. This variable produced by far the lowest prediction errors, with MSE values of 10.54-10.58, indicating a very strong inverse association between infant death and life expectancy.

GDP_per_capita also showed relatively strong univariate relationship. It had an MSE of 29.46–29.63, making it the second strongest continuous predictor.

Region was the third strongest predictor with a validation MSE of 32.76.

Schooling (MSE 37.34–37.69) and Incidents_HIV (MSE 38.76–40.85) showed moderate predictive strength as well.

Overall, Infant_deaths is the strongest univariate predictor of Life_expectancy, followed by GDP_per_capita. Region, Schooling, Incidents_HIV, and Region also show meaningful univariate associations.

Full Model

library(gam)
library(splines)

# Full model using best spline choices for each predictor
full_model <- gam(
  Life_expectancy ~ 
    s(Year, df = 2) +
    ns(Infant_deaths, df = 10) +
    ns(GDP_per_capita, df = 6) +
    s(Measles, df = 8) +
    s(BMI, df = 8) +
    ns(Incidents_HIV, df = 10) +
    ns(Schooling, df = 10) +
    s(Thinness_five_nine_years, df = 8) +
    Region,
  data = dftr
)

summary(full_model)
## 
## Call: gam(formula = Life_expectancy ~ s(Year, df = 2) + ns(Infant_deaths, 
##     df = 10) + ns(GDP_per_capita, df = 6) + s(Measles, df = 8) + 
##     s(BMI, df = 8) + ns(Incidents_HIV, df = 10) + ns(Schooling, 
##     df = 10) + s(Thinness_five_nine_years, df = 8) + Region, 
##     data = dftr)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7.88717 -1.02089  0.05979  1.14652  6.48500 
## 
## (Dispersion Parameter for gaussian family taken to be 3.4898)
## 
##     Null Deviance: 127300.2 on 1434 degrees of freedom
## Residual Deviance: 4760.14 on 1364 degrees of freedom
## AIC: 5937.08 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                                       Df Sum Sq Mean Sq   F value    Pr(>F)    
## s(Year, df = 2)                        1   3900  3899.7 1117.4443 < 2.2e-16 ***
## ns(Infant_deaths, df = 10)            10  97528  9752.8 2794.6174 < 2.2e-16 ***
## ns(GDP_per_capita, df = 6)             6   1008   168.0   48.1264 < 2.2e-16 ***
## s(Measles, df = 8)                     1      1     1.3    0.3857   0.53470    
## s(BMI, df = 8)                         1     59    58.9   16.8835 4.211e-05 ***
## ns(Incidents_HIV, df = 10)            10   7448   744.8  213.4231 < 2.2e-16 ***
## ns(Schooling, df = 10)                10    529    52.9   15.1445 < 2.2e-16 ***
## s(Thinness_five_nine_years, df = 8)    1     10     9.9    2.8422   0.09205 .  
## Region                                 8    636    79.5   22.7686 < 2.2e-16 ***
## Residuals                           1364   4760     3.5                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                                     Npar Df Npar F     Pr(F)    
## (Intercept)                                                     
## s(Year, df = 2)                           1 19.057 1.364e-05 ***
## ns(Infant_deaths, df = 10)                                      
## ns(GDP_per_capita, df = 6)                                      
## s(Measles, df = 8)                        7  7.073 2.571e-08 ***
## s(BMI, df = 8)                            7 10.603 4.816e-13 ***
## ns(Incidents_HIV, df = 10)                                      
## ns(Schooling, df = 10)                                          
## s(Thinness_five_nine_years, df = 8)       7 12.467 1.554e-15 ***
## Region                                                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# predictions on validation set
pred_val <- predict(full_model, newdata = dfval)

# validation MSE
MSE_full <- mean((dfval$Life_expectancy - pred_val)^2)

MSE_full
## [1] 3.847604

Comment: I fitted a GAM on the training set using the best-performing spline parameterizations from the univariate analyses (NS for Infant_deaths, GDP_per_capita, Incidents_HIV, and Schooling; SS for Year, Measles, BMI, and Thinness; and Region as a factor). The model significantly reduced deviance (from 127,300 to 4,760) and showed strong nonparametric effects for several predictors. The validation MSE was 3.85, indicating that the multivariable spline model provides substantial improvement in predictive accuracy over the univariate models.

Refine the model

Pseudo-backward stepwise selection

1. Check Year

# Full model validation MSE -> base MSE
best_MSE <- MSE_full   
best_MSE
## [1] 3.847604
# Model without Year
mod_noYear <- gam(
  Life_expectancy ~ 
    ns(Infant_deaths, df = 10) +
    ns(GDP_per_capita, df = 6) +
    s(Measles, df = 8) +
    s(BMI, df = 8) +
    ns(Incidents_HIV, df = 10) +
    ns(Schooling, df = 10) +
    s(Thinness_five_nine_years, df = 8) +
    Region,
  data = dftr
)

# validation MSE
pred_noYear <- predict(mod_noYear, newdata = dfval)
MSE_noYear <- mean((dfval$Life_expectancy - pred_noYear)^2)
MSE_noYear
## [1] 3.933785

Comment: MSE changed from 3.847604 to 3.933785. Removing Year worsened prediction error, so Year was kept in the model.

Best MSE remains 3.847604.

2. Check Infant_deaths

# Best MSE carried over from previous step
best_MSE
## [1] 3.847604
# Model without Infant_deaths
mod_noInfant <- gam(
  Life_expectancy ~
    s(Year, df = 2) +
    ns(GDP_per_capita, df = 6) +
    s(Measles, df = 8) +
    s(BMI, df = 8) +
    ns(Incidents_HIV, df = 10) +
    ns(Schooling, df = 10) +
    s(Thinness_five_nine_years, df = 8) +
    Region,
  data = dftr
)

# validation MSE
pred_noInfant <- predict(mod_noInfant, newdata = dfval)
MSE_noInfant <- mean((dfval$Life_expectancy - pred_noInfant)^2)
MSE_noInfant
## [1] 9.168095

Comment: MSE changed from 3.847604 to 9.168095. Removing Infant_deaths worsened prediction error, so Infant_deaths was kept in the model.

Best MSE remains 3.847604.

3. Check GDP_per_capita

# Best MSE carried over from previous step
best_MSE
## [1] 3.847604
# Model without GDP_per_capita
mod_noGDP <- gam(
  Life_expectancy ~
    s(Year, df = 2) +
    ns(Infant_deaths, df = 10) +
    s(Measles, df = 8) +
    s(BMI, df = 8) +
    ns(Incidents_HIV, df = 10) +
    ns(Schooling, df = 10) +
    s(Thinness_five_nine_years, df = 8) +
    Region,
  data = dftr
)

# validation MSE
pred_noGDP <- predict(mod_noGDP, newdata = dfval)
MSE_noGDP <- mean((dfval$Life_expectancy - pred_noGDP)^2)
MSE_noGDP
## [1] 4.036478

Comment: MSE changed from 3.847604 to 4.036478. Removing GDP_per_capita worsened prediction error, so GDP_per_capita was kept in the model.

Best MSE remains 3.847604.

4. Check Measles

# Best MSE carried over from previous step
best_MSE
## [1] 3.847604
# Model without Measles
mod_noMeasles <- gam(
  Life_expectancy ~
    s(Year, df = 2) +
    ns(Infant_deaths, df = 10) +
    ns(GDP_per_capita, df = 6) +
    s(BMI, df = 8) +
    ns(Incidents_HIV, df = 10) +
    ns(Schooling, df = 10) +
    s(Thinness_five_nine_years, df = 8) +
    Region,
  data = dftr
)

# validation MSE
pred_noMeasles <- predict(mod_noMeasles, newdata = dfval)
MSE_noMeasles <- mean((dfval$Life_expectancy - pred_noMeasles)^2)
MSE_noMeasles
## [1] 3.816505

Comment: MSE changed from 3.847604 to 3.816505. Removing Measles improved prediction error, so Measles was removed from the model.

Best MSE is now 3.816505.

5. Check BMI

# Best MSE updated
best_MSE <- MSE_noMeasles
best_MSE
## [1] 3.816505
# Model without BMI
mod_noBMI <- gam(
  Life_expectancy ~
    s(Year, df = 2) +
    ns(Infant_deaths, df = 10) +
    ns(GDP_per_capita, df = 6) +
    ns(Incidents_HIV, df = 10) +
    ns(Schooling, df = 10) +
    s(Thinness_five_nine_years, df = 8) +
    Region,
  data = dftr
)

# validation MSE
pred_noBMI <- predict(mod_noBMI, newdata = dfval)
MSE_noBMI <- mean((dfval$Life_expectancy - pred_noBMI)^2)
MSE_noBMI
## [1] 3.934864

Comment: MSE changed from 3.816505 to 3.934864. Removing BMI worsened prediction error, so BMI was kept in the model.

Best MSE remains 3.816505.

6. Check Incidents_HIV

# Best MSE carried over from previous step
best_MSE <- MSE_noMeasles   # 3.816505
best_MSE
## [1] 3.816505
# Model without Incidents_HIV
mod_noHIV <- gam(
  Life_expectancy ~
    s(Year, df = 2) +
    ns(Infant_deaths, df = 10) +
    ns(GDP_per_capita, df = 6) +
    s(BMI, df = 8) +
    ns(Schooling, df = 10) +
    s(Thinness_five_nine_years, df = 8) +
    Region,
  data = dftr
)

# validation MSE
pred_noHIV <- predict(mod_noHIV, newdata = dfval)
MSE_noHIV <- mean((dfval$Life_expectancy - pred_noHIV)^2)
MSE_noHIV
## [1] 7.165852

Comment: MSE changed from 3.816505 to 7.165852. Removing Incidents_HIV worsened prediction error, so Incidents_HIV was kept in the model.

Best MSE remains 3.816505.

7. Check Schooling

# Best MSE carried over from previous step
best_MSE <- MSE_noMeasles   # 3.816505
best_MSE
## [1] 3.816505
# Model without Schooling
mod_noSchool <- gam(
  Life_expectancy ~
    s(Year, df = 2) +
    ns(Infant_deaths, df = 10) +
    ns(GDP_per_capita, df = 6) +
    s(BMI, df = 8) +
    ns(Incidents_HIV, df = 10) +
    s(Thinness_five_nine_years, df = 8) +
    Region,
  data = dftr
)

# validation MSE
pred_noSchool <- predict(mod_noSchool, newdata = dfval)
MSE_noSchool <- mean((dfval$Life_expectancy - pred_noSchool)^2)
MSE_noSchool
## [1] 4.049667

Comment: MSE changed from 3.816505 to 4.049667. Removing Schooling worsened prediction error, so Schooling was kept in the model.

Best MSE remains 3.847604.

8. Check Thinness_five_nine_years

# Best MSE carried over from previous step
best_MSE <- MSE_noMeasles   # 3.816505
best_MSE
## [1] 3.816505
# Model without Thinness_five_nine_years
mod_noThin <- gam(
  Life_expectancy ~
    s(Year, df = 2) +
    ns(Infant_deaths, df = 10) +
    ns(GDP_per_capita, df = 6) +
    s(BMI, df = 8) +
    ns(Incidents_HIV, df = 10) +
    ns(Schooling, df = 10) +
    Region,
  data = dftr
)

# validation MSE
pred_noThin <- predict(mod_noThin, newdata = dfval)
MSE_noThin <- mean((dfval$Life_expectancy - pred_noThin)^2)
MSE_noThin
## [1] 3.969929

Comment: MSE changed from 3.816505 to 3.969929. Removing Thinness_five_nine_years worsened prediction error, so Thinness_five_nine_years was kept in the model.

Best MSE remains 3.816505.

9. Check Region

# Best MSE carried over from previous step
best_MSE <- MSE_noMeasles   # 3.816505
best_MSE
## [1] 3.816505
# Model without Region
mod_noRegion <- gam(
  Life_expectancy ~
    s(Year, df = 2) +
    ns(Infant_deaths, df = 10) +
    ns(GDP_per_capita, df = 6) +
    s(BMI, df = 8) +
    ns(Incidents_HIV, df = 10) +
    ns(Schooling, df = 10) +
    s(Thinness_five_nine_years, df = 8),
  data = dftr
)

# validation MSE
pred_noRegion <- predict(mod_noRegion, newdata = dfval)
MSE_noRegion <- mean((dfval$Life_expectancy - pred_noRegion)^2)
MSE_noRegion
## [1] 4.345767

Comment: MSE changed from 3.816505 to 4.345767. Removing Region worsened prediction error, so Region was kept in the model.

Best MSE remains 3.816505.

# Summarize MSE for each model
MSE_full
## [1] 3.847604
MSE_noYear
## [1] 3.933785
MSE_noInfant
## [1] 9.168095
MSE_noGDP
## [1] 4.036478
MSE_noMeasles
## [1] 3.816505
MSE_noBMI
## [1] 3.934864
MSE_noHIV
## [1] 7.165852
MSE_noSchool
## [1] 4.049667
MSE_noThin
## [1] 3.969929
MSE_noRegion
## [1] 4.345767

Model Building Table

Best MSE at Start Model (Remove Variable) Validation Set MSE Decision Best MSE at End
3.847604 Full Multivariable Model 3.847604 Base MSE 3.847604
3.847604 Remove Year 3.933785 MSE ↑ → KEEP 3.847604
3.847604 Remove Infant_deaths 9.168095 MSE ↑ → KEEP 3.847604
3.847604 Remove GDP_per_capita 4.036478 MSE ↑ → KEEP 3.847604
3.847604 Remove Measles 3.816505 MSE ↓ → REMOVE 3.816505
3.816505 Remove BMI 3.934864 MSE ↑ → KEEP 3.816505
3.816505 Remove Incidents_HIV 7.165852 MSE ↑ → KEEP 3.816505
3.816505 Remove Schooling 4.049667 MSE ↑ → KEEP 3.816505
3.816505 Remove Thinness_5_9_years 3.969929 MSE ↑ → KEEP 3.816505
3.816505 Remove Region 4.345767 MSE ↑ → KEEP 3.816505

Important Variables

Based on the univariate spline analyses, Infant_deaths showed the strongest relationship with Life_expectancy, producing the lowest validation MSE (approximately 10.54-10.58). This indicates a very strong inverse association, where higher infant deaths predict substantially lower life expectancy.

GDP_per_capita also demonstrated a strong univariate association, with validation MSE around 29.46–29.63, making it the second most predictive continuous variable.

Among the categorical variables, Region was the most informative predictor, with a validation MSE of 32.76, outperforming both developed- and developing-country indicators.

Other continuous predictors such as Schooling, Incidents_HIV, and BMI had moderate univariate relationships (MSE values between 37-43).

Overall, the variables most strongly related to Life_expectancy on a univariate basis are Infant_deaths, GDP_per_capita, and Region.

Final Model/Interpretation

# Final model includes all variables (Year, Infant_deaths, GDP_per_capita, BMI, Incidents_HIV, Schooling, Thinness_five_nine_years, and Region) except Measles.
final_model <- gam(
  Life_expectancy ~
    s(Year, df = 2) +
    ns(Infant_deaths, df = 10) +
    ns(GDP_per_capita, df = 6) +
    s(BMI, df = 8) +
    ns(Incidents_HIV, df = 10) +
    ns(Schooling, df = 10) +
    s(Thinness_five_nine_years, df = 8) +
    Region,
  data = dftr
)

summary(final_model)
## 
## Call: gam(formula = Life_expectancy ~ s(Year, df = 2) + ns(Infant_deaths, 
##     df = 10) + ns(GDP_per_capita, df = 6) + s(BMI, df = 8) + 
##     ns(Incidents_HIV, df = 10) + ns(Schooling, df = 10) + s(Thinness_five_nine_years, 
##     df = 8) + Region, data = dftr)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -8.27633 -0.99283  0.01517  1.19476  6.50034 
## 
## (Dispersion Parameter for gaussian family taken to be 3.5809)
## 
##     Null Deviance: 127300.2 on 1434 degrees of freedom
## Residual Deviance: 4913.04 on 1372 degrees of freedom
## AIC: 5966.448 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                                       Df Sum Sq Mean Sq   F value    Pr(>F)    
## s(Year, df = 2)                        1   3796  3795.5 1059.9233 < 2.2e-16 ***
## ns(Infant_deaths, df = 10)            10  99099  9909.9 2767.4130 < 2.2e-16 ***
## ns(GDP_per_capita, df = 6)             6   1092   182.1   50.8471 < 2.2e-16 ***
## s(BMI, df = 8)                         1     60    59.7   16.6838 4.671e-05 ***
## ns(Incidents_HIV, df = 10)            10   7303   730.3  203.9414 < 2.2e-16 ***
## ns(Schooling, df = 10)                10    563    56.3   15.7166 < 2.2e-16 ***
## s(Thinness_five_nine_years, df = 8)    1      2     1.7    0.4764    0.4902    
## Region                                 8    710    88.7   24.7756 < 2.2e-16 ***
## Residuals                           1372   4913     3.6                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                                     Npar Df Npar F     Pr(F)    
## (Intercept)                                                     
## s(Year, df = 2)                           1 20.895 5.287e-06 ***
## ns(Infant_deaths, df = 10)                                      
## ns(GDP_per_capita, df = 6)                                      
## s(BMI, df = 8)                            7 10.185 1.752e-12 ***
## ns(Incidents_HIV, df = 10)                                      
## ns(Schooling, df = 10)                                          
## s(Thinness_five_nine_years, df = 8)       7 15.146 < 2.2e-16 ***
## Region                                                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Final Model

# Final model fitted on training + validation (dftrval)
final_model <- gam(
  Life_expectancy ~
    s(Year, df = 2) +
    ns(Infant_deaths, df = 10) +
    ns(GDP_per_capita, df = 6) +
    s(BMI, df = 8) +
    ns(Incidents_HIV, df = 10) +
    ns(Schooling, df = 10) +
    s(Thinness_five_nine_years, df = 8) +
    Region,
  data = dftrval
)

summary(final_model)
## 
## Call: gam(formula = Life_expectancy ~ s(Year, df = 2) + ns(Infant_deaths, 
##     df = 10) + ns(GDP_per_capita, df = 6) + s(BMI, df = 8) + 
##     ns(Incidents_HIV, df = 10) + ns(Schooling, df = 10) + s(Thinness_five_nine_years, 
##     df = 8) + Region, data = dftrval)
## Deviance Residuals:
##       Min        1Q    Median        3Q       Max 
## -8.335666 -1.020961  0.002681  1.145005  7.848873 
## 
## (Dispersion Parameter for gaussian family taken to be 3.6013)
## 
##     Null Deviance: 189715 on 2149 degrees of freedom
## Residual Deviance: 7515.88 on 2087 degrees of freedom
## AIC: 8920.268 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                                       Df Sum Sq Mean Sq   F value    Pr(>F)    
## s(Year, df = 2)                        1   5884  5883.5 1633.7290 < 2.2e-16 ***
## ns(Infant_deaths, df = 10)            10 148042 14804.2 4110.7996 < 2.2e-16 ***
## ns(GDP_per_capita, df = 6)             6   1442   240.3   66.7147 < 2.2e-16 ***
## s(BMI, df = 8)                         1     34    34.4    9.5403  0.002037 ** 
## ns(Incidents_HIV, df = 10)            10  10504  1050.4  291.6793 < 2.2e-16 ***
## ns(Schooling, df = 10)                10    823    82.3   22.8659 < 2.2e-16 ***
## s(Thinness_five_nine_years, df = 8)    1      3     2.9    0.8126  0.367469    
## Region                                 8   1142   142.8   39.6547 < 2.2e-16 ***
## Residuals                           2087   7516     3.6                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                                     Npar Df Npar F     Pr(F)    
## (Intercept)                                                     
## s(Year, df = 2)                           1 17.490 3.007e-05 ***
## ns(Infant_deaths, df = 10)                                      
## ns(GDP_per_capita, df = 6)                                      
## s(BMI, df = 8)                            7 14.079 < 2.2e-16 ***
## ns(Incidents_HIV, df = 10)                                      
## ns(Schooling, df = 10)                                          
## s(Thinness_five_nine_years, df = 8)       7 20.543 < 2.2e-16 ***
## Region                                                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Final MSE

# Testing MSE

# Predict using the final model on the testing set
pred_test <- predict(final_model, newdata = dftest)

# Compute test MSE
MSE_test <- mean((dftest$Life_expectancy - pred_test)^2)
MSE_test
## [1] 3.740605
# Predicted vs Observed Plot
plot(
  dftest$Life_expectancy, pred_test,
  xlab = "Observed Life Expectancy",
  ylab = "Predicted Life Expectancy",
  main = "Predicted vs Observed Life Expectancy (Testing Set)",
  pch = 19, col = "darkgray"
)

# Add x=y reference line
abline(a = 0, b = 1, col = "red", lwd = 2)

Comment: The testing MSE for the final model was 3.740605, indicating strong predictive performance on unseen data.

The predicted vs. observed plot shows that the predictions from the final model align closely with the 45-degree reference line. Most points fall tightly around the line across the full range of life expectancy values, suggesting that the model captures the underlying relationship well. While some variability appears at the lower and upper extremes, the overall fit indicates that the model predicts life expectancy accurately for the test data.

Interpretation

Our goal was to build a model that accurately predicts life expectancy based on a country’s demographic, health, and economic characteristics. Using data-driven variable selection and spline-based modeling to capture nonlinear relationships, the final model performed well, achieving a low testing MSE of approximately 3.74.

Across all predictors considered, infant mortality and GDP per capita were among the strongest contributors to explaining life expectancy, showing clear nonlinear associations. Schooling, HIV incidence, BMI, and the thinness indicator also provided meaningful predictive information, while geographic region added important categorical structure. Measles incidence was the only variable that did not improve model performance and was removed during backward selection.

Overall, the model generalizes well and makes accurate predictions across the full range of life expectancy values, as shown by the close alignment between predicted and observed outcomes in the testing set. This suggests that the included variables collectively capture key population characteristics associated with national life expectancy. For an investigator, this model can serve as a reliable tool for predicting life expectancy and for understanding which factors most strongly influence these predictions.